Introduction

This page lists performance figures for masked implementations of the algorithms. All figures were calculated on an Arudino Due, which is an ARM Cortex M3 running at 84MHz.

The average performance of the baseline unmasked algorithm is compared with its masked counterpart to produce the figures below. The figures show the amount of overhead. For example, if an algorithm has an overhead of 6.91 for four shares, then that means that the 4-share masked version is on average 6.91 times slower than the baseline unmasked version.

Ideally, a masked algorithm with N shares would have an overhead of roughly N, but this won't normally be the case. Calls to the system random number generator may be slow: the Arduino Due produces a new random number every 84 clock cycles which can introduce delays. As the number of shares increases, the delays due to random number generation become more significant.

Some algorithms can do better than N. Spook for example only masks the initialization and finalization steps, with the rest using the regular unmasked code. So sometimes Spook does better than N. But as N increases, the random number generation overhead becomes more significant.

"Degree of Masking" in the table below indicates how much of the algorithm runtime is masked. A value of "Init/Final" indicates that initialization and finalization tasks that involve the key are masked but everything else uses the baseline unmasked code. A value of "Init" indicates that only initialization is masked. A value of "Full" indicates that every block operation is fully masked.

Performance on ARM Cortex M3

Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family. Lower numbers are better.

Algorithm	Degree of Masking	2 shares	3 shares	4 shares	5 shares	6 shares
*ASCON-128*	Init/Final	3.93	8.52	14.40	23.35	34.03
*ASCON-128*	Full	8.03	18.36	33.44	52.34	74.72
ASCON-128a	Init/Final	4.33	9.83	16.71	27.21	38.78
ASCON-128a	Full	7.48	17.47	31.70	48.52	69.33
ASCON-80pq	Init/Final	3.87	8.54	14.44	23.38	33.44
ASCON-80pq	Full	7.90	18.39	33.55	51.45	73.47
*GIFT-COFB*	Full	6.28	19.27	36.84	62.16	91.22
*TinyJAMBU-128*	Full	3.16	8.04	14.59	25.65	43.13
TinyJAMBU-192	Full	3.24	8.25	14.94	26.24	44.27
TinyJAMBU-256	Full	3.25	8.31	15.08	26.49	44.86
*Xoodyak*	Init	1.94	3.76	6.39	10.31	14.51
*Xoodyak*	Full	5.89	15.86	28.53	47.70	69.57

It was observed that about 30% of the overhead of the 4-share version was due to Arduino Due's TRNG which produces a new 32-bit random word every 84 clock cycles. The code had to stop and wait for the TRNG quite a bit. On a different CPU with a faster TRNG, the results would be better.

Pyjamask and Spook were designed by the authors with masking in mind as a primary goal and they have been masked according to the authors' recommendations. The other algorithms were not designed with masking as a primary goal.

ISAP also provides side-channel protection but it is built into the standard design with no masking required. If it was to appear in the above table, all columns would be set to "1.00".

Rankings

The following table ranks the primary algorithms in increasing order of 4-share overhead:

Algorithm	Degree of Masking	4 shares
ISAP-K-128A	Init/Final	1.00
Xoodyak	Init	6.39
ASCON-128	Init/Final	14.40
TinyJAMBU-128	Full	14.59
Xoodyak	Full	28.53
ASCON-128	Full	33.44
GIFT-COFB	Full	36.84

ISAP has been included for comparison purposes. The baseline versions of that algorithm implements side channel protection without the need for random masking, so the "4 share" value is effectively 1.00.

The following table divides the 4-share overhead by the ChaChaPoly ranking from the baseline ARM Cortex M3 performance rankings. This gives an indication as to the relative performance of the masked algorithms in software, with the fastest at the top of the table:

Algorithm	Degree of Masking	4 shares	ChaChaPoly	4 shares / ChaChaPoly
Xoodyak	Init	6.39	0.86	7.43
ASCON-128	Init/Final	14.40	1.11	12.97
TinyJAMBU-128	Full	14.59	0.81	18.01
ASCON-128	Full	33.44	1.11	30.13
Xoodyak	Full	28.53	0.86	33.17
GIFT-COFB	Full	36.84	1.05	35.09
ISAP-K-128A	Init/Final	1.00	0.02	50.00

Optimisation tricks

This section describes some optimisations for masked ciphers that I encountered while writing the above implementations. These tricks may help other implementers of masked algorithms in C and assembly code.

Random number generation can add a lot of overhead to the runtime. Some CPU's offer a built-in TRNG but it may take a lot of clock cycles to generate each new random word (84 for the Arduino Due). It helps if the RNG calls can be spaced out with more regular instructions between the calls. Then less time is spent polling for the next random word.

Operate on individual shares as much as possible: do everything on A, then everything on B, etc. This reduces the register spills that occur when switching between shares. AND steps are where it becomes difficult because all shares are needed to do those steps. Try to group the simpler XOR masking steps before and after the AND steps so that the shares can be operated on independently in most of the code.

Sometimes it can help to operate on the shares in reverse order just before an AND step. The 3-share AND code operates on A, then B, then C. If the previous steps were operating on C, then B, then A, then A and parts of B are already in registers ready for the start of the AND.

Also see my page on masking utilities.

Table of Contents

Introduction

Performance on ARM Cortex M3

Rankings

Optimisation tricks