Lightweight Cryptography Primitives
 All Data Structures Files Functions Variables Typedefs Macros Pages
Performance of Masked Algorithms

Table of Contents

These performance figures are out of date. They are based on previous implementations of the baseline algorithms in plain C. Some of the algorithms have since been rewritten in assembly code but the figures have not been updated yet.


This page lists performance figures for masked implementations of the algorithms. All figures were calculated on an Arudino Due, which is an ARM Cortex M3 running at 84MHz.

The average performance of the baseline unmasked algorithm is compared with its masked counterpart to produce the figures below. The figures show the amount of overhead. For example, if an algorithm has an overhead of 6.91 for four shares, then that means that the 4-share masked version is on average 6.91 times slower than the baseline unmasked version.

Ideally, a masked algorithm with N shares would have an overhead of roughly N, but this won't normally be the case. Calls to the system random number generator may be slow: the Arduino Due produces a new random number every 84 clock cycles which can introduce delays. As the number of shares increases, the delays due to random number generation become more significant.

Some algorithms can do better than N. Spook for example only masks the initialization and finalization steps, with the rest using the regular unmasked code. So sometimes Spook does better than N. But as N increases, the random number generation overhead becomes more significant.

"Degree of Masking" in the table below indicates how much of the algorithm runtime is masked. A value of "Init/Final" indicates that initialization and finalization tasks that involve the key are masked but everything else uses the baseline unmasked code. A value of "Init" indicates that only initialization is masked. A value of "Full" indicates that every block operation is fully masked.

Performance on ARM Cortex M3

Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family. Lower numbers are better.

AlgorithmDegree of Masking2 shares3 shares4 shares5 shares6 shares

It was observed that about 30% of the overhead of the 4-share version was due to Arduino Due's TRNG which produces a new 32-bit random word every 84 clock cycles. The code had to stop and wait for the TRNG quite a bit. On a different CPU with a faster TRNG, the results would be better.

Pyjamask and Spook were designed by the authors with masking in mind as a primary goal and they have been masked according to the authors' recommendations. The other algorithms were not designed with masking as a primary goal.

ISAP also provides side-channel protection but it is built into the standard design with no masking required. If it was to appear in the above table, all columns would be set to "1.00".


The following table ranks the primary algorithms in increasing order of 4-share overhead:

AlgorithmDegree of Masking4 shares

ISAP has been included for comparison purposes. The baseline versions of that algorithm implements side channel protection without the need for random masking, so the "4 share" value is effectively 1.00.

The following table divides the 4-share overhead by the ChaChaPoly ranking from the baseline ARM Cortex M3 performance rankings. This gives an indication as to the relative performance of the masked algorithms in software, with the fastest at the top of the table:

AlgorithmDegree of Masking4 sharesChaChaPoly4 shares / ChaChaPoly

Optimisation tricks

This section describes some optimisations for masked ciphers that I encountered while writing the above implementations. These tricks may help other implementers of masked algorithms in C and assembly code.

Random number generation can add a lot of overhead to the runtime. Some CPU's offer a built-in TRNG but it may take a lot of clock cycles to generate each new random word (84 for the Arduino Due). It helps if the RNG calls can be spaced out with more regular instructions between the calls. Then less time is spent polling for the next random word.

Operate on individual shares as much as possible: do everything on A, then everything on B, etc. This reduces the register spills that occur when switching between shares. AND steps are where it becomes difficult because all shares are needed to do those steps. Try to group the simpler XOR masking steps before and after the AND steps so that the shares can be operated on independently in most of the code.

Sometimes it can help to operate on the shares in reverse order just before an AND step. The 3-share AND code operates on A, then B, then C. If the previous steps were operating on C, then B, then A, then A and parts of B are already in registers ready for the start of the AND.

Also see my page on masking utilities.