Lightweight Cryptography Primitives
|
This page lists performance figures for masked implementations of the algorithms. All figures were calculated on an Arudino Due, which is an ARM Cortex M3 running at 84MHz.
The average performance of the baseline unmasked algorithm is compared with its masked counterpart to produce the figures below. The figures show the amount of overhead. For example, if an algorithm has an overhead of 6.91 for four shares, then that means that the 4-share masked version is on average 6.91 times slower than the baseline unmasked version.
Ideally, a masked algorithm with N shares would have an overhead of roughly N, but this won't normally be the case. Calls to the system random number generator may be slow: the Arduino Due produces a new random number every 84 clock cycles which can introduce delays. As the number of shares increases, the delays due to random number generation become more significant.
Some algorithms can do better than N. Spook for example only masks the initialization and finalization steps, with the rest using the regular unmasked code. So sometimes Spook does better than N. But as N increases, the random number generation overhead becomes more significant.
"Degree of Masking" in the table below indicates how much of the algorithm runtime is masked. A value of "Init/Final" indicates that initialization and finalization tasks that involve the key are masked but everything else uses the baseline unmasked code. A value of "Init" indicates that only initialization is masked. A value of "Full" indicates that every block operation is fully masked.
Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family. Lower numbers are better.
Algorithm | Degree of Masking | 2 shares | 3 shares | 4 shares | 5 shares | 6 shares |
ASCON-128 | Init/Final | 3.93 | 8.52 | 14.40 | 23.35 | 34.03 |
ASCON-128 | Full | 8.03 | 18.36 | 33.44 | 52.34 | 74.72 |
ASCON-128a | Init/Final | 4.33 | 9.83 | 16.71 | 27.21 | 38.78 |
ASCON-128a | Full | 7.48 | 17.47 | 31.70 | 48.52 | 69.33 |
ASCON-80pq | Init/Final | 3.87 | 8.54 | 14.44 | 23.38 | 33.44 |
ASCON-80pq | Full | 7.90 | 18.39 | 33.55 | 51.45 | 73.47 |
GIFT-COFB | Full | 6.28 | 19.27 | 36.84 | 62.16 | 91.22 |
GIMLI-24 | Init | 2.69 | 6.28 | 11.93 | 19.80 | 29.33 |
GIMLI-24 | Full | 10.50 | 31.41 | 60.37 | 106.37 | 159.46 |
KNOT-AEAD-128-256 | Init | 2.01 | 4.88 | 8.09 | 12.74 | 19.36 |
KNOT-AEAD-128-256 | Full | 6.12 | 17.99 | 32.06 | 52.46 | 81.93 |
KNOT-AEAD-128-384 | Init | 2.92 | 7.74 | 14.01 | 23.93 | 35.57 |
KNOT-AEAD-128-384 | Full | 4.60 | 13.91 | 26.86 | 43.77 | 65.54 |
KNOT-AEAD-192-384 | Init | 2.13 | 4.94 | 8.62 | 14.44 | 21.22 |
KNOT-AEAD-192-384 | Full | 4.67 | 14.19 | 27.46 | 44.74 | 67.07 |
KNOT-AEAD-256-512 | Init | 2.32 | 5.47 | 9.77 | 14.92 | 23.67 |
KNOT-AEAD-256-512 | Full | 4.93 | 13.98 | 26.97 | 43.61 | 65.27 |
Pyjamask-128-AEAD | Full | 2.94 | 3.05 | 5.31 | 5.74 | 8.14 |
Pyjamask-96-AEAD | Full | 2.94 | 3.06 | 5.33 | 5.73 | 8.14 |
SPIX | Init/Final | 4.55 | 12.17 | 23.47 | 38.60 | 59.69 |
SPIX | Full | 6.64 | 19.44 | 38.13 | 62.76 | 93.72 |
SpoC-128 | Full | 6.65 | 19.39 | 38.57 | 63.26 | 95.21 |
SpoC-64 | Full | 4.02 | 11.06 | 20.38 | 35.05 | 50.44 |
Spook-128-384-mu | Init/Final | 1.71 | 3.33 | 5.50 | 8.07 | 11.47 |
Spook-128-512-mu | Init/Final | 1.73 | 3.35 | 5.41 | 8.15 | 11.22 |
Spook-128-384-su | Init/Final | 1.72 | 3.34 | 5.53 | 8.09 | 11.52 |
Spook-128-512-su | Init/Final | 1.74 | 3.36 | 5.43 | 8.17 | 11.26 |
TinyJAMBU-128 | Full | 3.16 | 8.04 | 14.59 | 25.65 | 43.13 |
TinyJAMBU-192 | Full | 3.24 | 8.25 | 14.94 | 26.24 | 44.27 |
TinyJAMBU-256 | Full | 3.25 | 8.31 | 15.08 | 26.49 | 44.86 |
Xoodyak | Init | 1.94 | 3.76 | 6.39 | 10.31 | 14.51 |
Xoodyak | Full | 5.89 | 15.86 | 28.53 | 47.70 | 69.57 |
It was observed that about 30% of the overhead of the 4-share version was due to Arduino Due's TRNG which produces a new 32-bit random word every 84 clock cycles. The code had to stop and wait for the TRNG quite a bit. On a different CPU with a faster TRNG, the results would be better.
Pyjamask and Spook were designed by the authors with masking in mind as a primary goal and they have been masked according to the authors' recommendations. The other algorithms were not designed with masking as a primary goal.
ISAP and DryGASCON also provide side-channel protection but it is built into the standard design with no masking required. If they were to appear in the above table, all columns would be set to "1.00".
The following table ranks the primary algorithms in increasing order of 4-share overhead:
Algorithm | Degree of Masking | 4 shares |
DryGASCON128 | Full | 1.00 |
ISAP-K-128A | Init/Final | 1.00 |
Pyjamask-128-AEAD | Full | 5.31 |
Spook-128-512-su | Init/Final | 5.43 |
Xoodyak | Init | 6.39 |
KNOT-AEAD-128-256 | Init | 8.09 |
GIMLI-24 | Init | 11.93 |
ASCON-128 | Init/Final | 14.40 |
TinyJAMBU-128 | Full | 14.59 |
SpoC-64 | Full | 20.38 |
SPIX | Init/Final | 23.47 |
Xoodyak | Full | 28.53 |
KNOT-AEAD-128-256 | Full | 32.06 |
ASCON-128 | Full | 33.44 |
GIFT-COFB | Full | 36.84 |
SPIX | Full | 38.13 |
GIMLI-24 | Full | 60.37 |
DryGASCON and ISAP have been included for comparison purposes. The baseline versions of those algorithms implement side channel protection without the need for random masking, so their "4 share" value is effectively 1.00.
Pyjamask and Spook are near the head of the table because they were explicitly designed with masking in mind so the overhead is less than the other algorithms. However, the baseline version of Pyjamask is fairly slow in software. So the above table doesn't give a good idea as to the actual software performance.
The following table divides the 4-share overhead by the ChaChaPoly ranking from the baseline ARM Cortex M3 performance rankings. This gives an indication as to the relative performance of the masked algorithms in software, with the fastest at the top of the table:
Algorithm | Degree of Masking | 4 shares | ChaChaPoly | 4 shares / ChaChaPoly |
DryGASCON128 | Full | 1.00 | 0.22 | 4.55 |
Spook-128-512-su | Init/Final | 5.43 | 0.90 | 6.03 |
Xoodyak | Init | 6.39 | 0.86 | 7.43 |
ASCON-128 | Init/Final | 14.40 | 1.11 | 12.97 |
GIMLI-24 | Init | 11.93 | 0.91 | 13.11 |
TinyJAMBU-128 | Full | 14.59 | 0.81 | 18.01 |
KNOT-AEAD-128-256 | Init | 8.09 | 0.38 | 21.29 |
ASCON-128 | Full | 33.44 | 1.11 | 30.13 |
Xoodyak | Full | 28.53 | 0.86 | 33.17 |
GIFT-COFB | Full | 36.84 | 1.05 | 35.09 |
ISAP-K-128A | Init/Final | 1.00 | 0.02 | 50.00 |
SPIX | Init/Final | 23.47 | 0.40 | 58.68 |
SpoC-64 | Full | 20.38 | 0.31 | 65.74 |
GIMLI-24 | Full | 60.37 | 0.91 | 66.34 |
Pyjamask-128-AEAD | Full | 5.31 | 0.07 | 75.86 |
KNOT-AEAD-128-256 | Full | 32.06 | 0.38 | 84.37 |
SPIX | Full | 38.13 | 0.40 | 95.33 |
This section describes some optimisations for masked ciphers that I encountered while writing the above implementations. These tricks may help other implementers of masked algorithms in C and assembly code.
Random number generation can add a lot of overhead to the runtime. Some CPU's offer a built-in TRNG but it may take a lot of clock cycles to generate each new random word (84 for the Arduino Due). It helps if the RNG calls can be spaced out with more regular instructions between the calls. Then less time is spent polling for the next random word.
Operate on individual shares as much as possible: do everything on A, then everything on B, etc. This reduces the register spills that occur when switching between shares. AND steps are where it becomes difficult because all shares are needed to do those steps. Try to group the simpler XOR masking steps before and after the AND steps so that the shares can be operated on independently in most of the code.
Sometimes it can help to operate on the shares in reverse order just before an AND step. The 3-share AND code operates on A, then B, then C. If the previous steps were operating on C, then B, then A, then A and parts of B are already in registers ready for the start of the AND.
Also see my page on masking utilities.