Introduction

NIST set a cut-off of 18 Septeber 2020 for status updates from the Round 2 candidate submission teams. Since that date, some newer implementations have been contributed by others and written by myself.

This page compares the performance of the original baseline versions with the newer submissions for "Phase 2" of the project. The original performance page has been updated with the new figures.

For phase 2, I am focusing mainly on the 32-bit ARM Cortex M3 microprocessor in the Arduino Due device that I used for previous testing. ESP32 and AVR figures may be included if they provide interesting results.

ASCON and GASCON

The baseline versions of ASCON-128, ASCON-128a, and ASCON-80pq for 32-bit platforms use the 32-bit bit-sliced representation. Plaintext and associated data is converted into bit-sliced form prior to being absorbed by the permutation. Squeezed ciphertext and the tag are converted from bit-sliced form to regular form on output.

The GASCON core function was part of the DryGASCON submission to Round 2. It is identical to ASCON except that the inputs and outputs of the permutation are already in 32-bit bit-sliced form. If DryGASCON is admitted to Round 3, then the authors have suggested including GASCON in Round 3 as well in their status update.

Sébastien Riou of the DryGASCON submission team contributed versions of ASCON-128, ASCON-128a, and ASCON-80pq where the GASCON permutation was directly substituted for ASCON. This avoids the need to convert back and forth between bit-sliced form and regular form. After applying his patch and adding a few tweaks of my own, replacing ASCON with GASCON provided between 12% and 17% improvement in performance on ARM Cortex M3:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
GASCON-128a	Sébastien Riou	1.48	1.44	1.56	1.58	1.52
ASCON-128a	Baseline	1.29	1.26	1.33	1.32	1.30
GASCON-128	Sébastien Riou	1.10	1.10	1.41	1.39	1.25
GASCON-80pq	Sébastien Riou	1.09	1.10	1.37	1.39	1.24
ASCON-80pq	Baseline	0.99	1.02	1.22	1.23	1.12
ASCON-128	Baseline	0.99	1.02	1.22	1.23	1.11

AVR tells a slightly different story because the bit-sliced representation doesn't help improve ASCON performance on AVR due to the lack of a barrel shifter. The two AVR versions are almost identical except for the diffusion layer.

The diffusion layer of ASCON-AVR operates on 64-bit words whereas the diffusion layer of GASCON-AVR operates on 32-bit words. This can lead to slightly more housekeeping for the 32-bit version to deal with two sets of carry bits during rotations. With some extra loop unrolling or clever register management, it may be possible to improve this, but the same trick would also make ASCON-AVR faster.

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
ASCON-128a	Baseline	2.79	2.57	4.10	3.91	3.02
GASCON-128a	Rhys Weatherley	2.68	2.51	3.88	3.74	2.92
ASCON-80pq	Baseline	2.05	1.95	3.65	3.52	2.36
ASCON-128	Baseline	2.05	1.95	3.65	3.50	2.36
GASCON-80pq	Rhys Weatherley	1.99	1.91	3.48	3.38	2.29
GASCON-128	Rhys Weatherley	1.98	1.90	3.48	3.37	2.28

I have recently created an assembly code version of 32-bit bit-sliced ASCON for ARM Cortex M3 and similar microprocessors, which can be found within "src/combined/internal-ascon-arm-cm3.S" in the source tree. The assembly code was generated by the code under "src/genarm". For fairness, I also converted GASCON:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
GASCON-128a	Rhys Weatherley	2.14	1.88	2.11	1.97	2.01
ASCON-128a	Rhys Weatherley	1.86	1.70	1.80	1.78	1.78
GASCON-128	Rhys Weatherley	1.67	1.54	2.03	1.88	1.77
GASCON-80pq	Rhys Weatherley	1.64	1.51	1.91	1.78	1.71
ASCON-128	Rhys Weatherley	1.54	1.44	1.78	1.68	1.61
ASCON-80pq	Rhys Weatherley	1.52	1.43	1.71	1.65	1.57
ASCON-128a	Baseline	1.29	1.26	1.33	1.32	1.30
ASCON-80pq	Baseline	0.99	1.02	1.22	1.23	1.12
ASCON-128	Baseline	0.99	1.02	1.22	1.23	1.11

COMET

I implemented ARM Cortex M3 assembly versions of the CHAM-128, CHAM-64, and SPECK-64 block ciphers to accelerate COMET:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
COMET-128_CHAM-128/128	Rhys Weatherley	1.57	1.56	2.91	2.69	2.05
COMET-64_SPECK-64/128	Rhys Weatherley	1.42	1.43	2.86	2.75	1.94
COMET-128_CHAM-128/128	Baseline	1.22	1.25	2.20	2.08	1.61
COMET-64_SPECK-64/128	Baseline	1.16	1.14	2.31	2.24	1.58
COMET-64_CHAM-64/128	Rhys Weatherley	0.70	0.75	1.35	1.37	0.97
COMET-64_CHAM-64/128	Baseline	0.40	0.43	0.79	0.81	0.57

Both CHAM-128 and SPECK-64 are fully unrolled and fit entirely within the ARM's register set. Memory loads and stores are only required during function setup and cleanup. CHAM-64 requires some local stack space to hold part of the key schedule as there aren't enough registers.

DryGASCON

Sébastien Riou of the DryGASCON submission team contributed ARM Cortex M* assembly code versions of DryGASCON128 with key sizes 16, 32, and 56. The baseline version only had key size 16 and was written in C.

His submission also aligns the "x" words so that the entire "x" array fits within a single cache line in the CPU. This allows him to do away with my complex constant-time method for selecting an "x" word.

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
DryGASCON128k32	Sébastien Riou	0.59	0.62	1.05	1.04	0.79
DryGASCON128k56	Sébastien Riou	0.59	0.62	1.05	1.03	0.79
DryGASCON128k16	Sébastien Riou	0.59	0.62	1.03	1.02	0.78
DryGASCON128k16	Baseline	0.16	0.18	0.28	0.30	0.22

DryGASCON128-HASH shows a similar improvement:

Algorithm	Contributor	1024 bytes	128 bytes	16 bytes	Average
DryGASCON128-HASH	Sébastien Riou	0.29	0.29	0.88	0.48
DryGASCON128-HASH	Baseline	0.08	0.07	0.25	0.13

GIFT-128

I implemented ARM Cortex M3 assembly versions of the GIFT-128 block cipher to support accelerated versions of the the ESTATE, GIFT-COFB, HYENA, and SUNDAE-GIFT submissions to Round 2. Versions were created for the "full", "small", and "tiny" variants of GIFT-128. The following figures are for the full fixsliced variant:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
GIFT-COFB	Rhys Weatherley	1.01	1.01	1.16	1.15	1.08
GIFT-COFB	Baseline	1.02	1.01	1.09	1.09	1.05
HYENA	Rhys Weatherley	0.68	0.74	0.87	0.88	0.80
SUNDAE-GIFT-0	Rhys Weatherley	0.57	0.61	1.04	1.05	0.78
SUNDAE-GIFT-0	Baseline	0.58	0.62	1.01	1.02	0.77
HYENA	Baseline	0.62	0.65	0.81	0.84	0.73
ESTATE_TweGIFT-128	Rhys Weatherley	0.53	0.57	1.04	1.04	0.74
SUNDAE-GIFT-64	Rhys Weatherley	0.54	0.58	0.84	0.86	0.69
SUNDAE-GIFT-64	Baseline	0.55	0.59	0.82	0.84	0.69
SUNDAE-GIFT-96	Rhys Weatherley	0.54	0.58	0.83	0.85	0.69
SUNDAE-GIFT-96	Baseline	0.55	0.59	0.81	0.83	0.68
SUNDAE-GIFT-128	Rhys Weatherley	0.54	0.58	0.81	0.83	0.68
SUNDAE-GIFT-128	Baseline	0.54	0.59	0.79	0.82	0.67
ESTATE_TweGIFT-128	Baseline	0.48	0.51	0.92	0.92	0.66

The assembly versions provided a modest improvement in performance, but it wasn't as substantial as for other submissions. The C compiler actually does a pretty good job on my GIFT-128 block cipher implementations in C.

Further improvements will be investigated later. The GIFT authors' ARM implementations have some other tricks that I haven't implemented yet, such as deferring word rotations from one step to be performed during the following step.

Gimli

I implemented ARM Cortex M3 assembly versions of the GIMLI-24 permutation. The implementation is fully unrolled with the entire state held in registers. The GIMLI-24 AEAD mode shows a 30% improvement over the baseline:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
GIMLI-24	Rhys Weatherley	1.08	1.09	1.29	1.28	1.18
GIMLI-24	Baseline	0.84	0.85	0.97	0.98	0.91

Similar improvements were seen for GIMLI-24-HASH:

Algorithm	Contributor	1024 bytes	128 bytes	16 bytes	Average
GIMLI-24-HASH	Rhys Weatherley	0.54	0.47	0.86	0.62
GIMLI-24-HASH	Baseline	0.45	0.35	0.61	0.46

ISAP

Using the ARM Cortex M3 assembly version of ASCON provides an improvement to the performance of ISAP-A:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
ISAP-A-128A	Rhys Weatherley	0.24	0.26	0.13	0.14	0.18
ISAP-A-128A	Baseline	0.17	0.19	0.10	0.11	0.13
ISAP-A-128	Rhys Weatherley	0.08	0.08	0.03	0.04	0.05
ISAP-A-128	Baseline	0.05	0.05	0.02	0.02	0.03

Pyjamask

Unrolling the circulant matrix multiplication step of Pyjamask produces a three-fold performance improvement in the C version of the algorithm:

Algorithm	Version	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Pyjamask-96-AEAD	Unrolled	0.22	0.25	0.25	0.27	0.25
Pyjamask-128-AEAD	Unrolled	0.22	0.24	0.24	0.25	0.24
Pyjamask-96-AEAD	Baseline	0.07	0.07	0.07	0.08	0.07
Pyjamask-128-AEAD	Baseline	0.06	0.07	0.07	0.07	0.07

Circulant multiplication has two arguments, a matrix "x" and a state word "y", where the matrix is a constant. The reference implementation from the authors rotates and XOR's the matrix with the result wherever there is a 1 bit in the state word:

result ^= x & -((y >> bit) & 1);

x = rightRotate1(x);

However, circulant multiplication is commutative so we can swap the arguments. Because the matrix is a constant, we only need to perform XOR's and rotations for set bits in the matrix and ignore the unset bits. The matrix values in the standard algorithm have an average of 12 set bits, which reduces the number of XOR's and rotations significantly. The resulting implementation is approximately 10 times faster than the baseline version:

Algorithm	Version	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Pyjamask-96-AEAD	Reversed Multiplication	0.66	0.67	0.81	0.83	0.74
Pyjamask-128-AEAD	Reversed Multiplication	0.67	0.63	0.80	0.79	0.72
Pyjamask-96-AEAD	Baseline	0.07	0.07	0.07	0.08	0.07
Pyjamask-128-AEAD	Baseline	0.06	0.07	0.07	0.07	0.07

According to the Pyjamask authors, swapping the arguments to circulant multiplication should have no affect on the algorithm's resistance against power analysis when used in masked form.

I also experiemented with ARM Cortex M3 assembly versions of Pyjamask but there wasn't much difference in performance to the plain C version. So for now I am sticking with the C version.

The AVR version of Pyjamask also shows a significant improvement by swapping the arguments:

Algorithm	Version	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Pyjamask-96-AEAD	Rhys Weatherley	1.47	1.45	2.13	2.11	1.64
Pyjamask-128-AEAD	Rhys Weatherley	1.39	1.34	1.95	1.91	1.52
Pyjamask-96-AEAD	Baseline	0.66	0.67	0.96	0.96	0.74
Pyjamask-128-AEAD	Baseline	0.63	0.64	0.89	0.90	0.71

Saturnin

The S-box and MDS steps in Saturnin rotate the words of the state at various points. In a previous change, I made the S-box words rotations implicit within the higher level round function. In the latest change, I did the same for MDS. Doing this provides a modest improvement in the performance of the C version:

Algorithm	Version	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
SATURNIN-Short	Rhys Weatherley			1.82	1.66	1.73
SATURNIN-Short	Baseline			1.62	1.69	1.66
SATURNIN-CTR-Cascade	Rhys Weatherley	0.39	0.42	0.42	0.44	0.42
SATURNIN-CTR-Cascade	Baseline	0.34	0.36	0.37	0.38	0.36

Similar performance improvements are also seen for SATURNIN-Hash:

Algorithm	Contributor	1024 bytes	128 bytes	16 bytes	Average
SATURNIN-Hash	Rhys Weatherley	0.28	0.23	0.57	0.36
SATURNIN-Hash	Baseline	0.24	0.20	0.49	0.31

SPARKLE

I implemented fully unrolled ARM Cortex M3 assembly versions of the SPARKLE-256, SPARKLE-384, and SPARKLE-512 permutations. There was up to a 70% improvement in performance in some of the algorithms.

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Schwaemm128-128	Rhys Weatherley	1.60	1.58	2.84	2.39	2.01
Schwaemm256-128	Rhys Weatherley	1.74	1.63	1.90	1.93	1.80
Schwaemm192-192	Rhys Weatherley	1.47	1.50	1.98	1.81	1.68
Schwaemm128-128	Baseline	1.17	1.15	1.93	1.80	1.46
Schwaemm256-256	Rhys Weatherley	1.18	1.16	1.15	1.09	1.14
Schwaemm256-128	Baseline	1.08	1.12	1.08	1.10	1.09
Schwaemm192-192	Baseline	0.90	0.92	1.04	1.07	0.99
Schwaemm256-256	Baseline	0.79	0.80	0.74	0.72	0.76

The improvements to hashing performance was even more spectacular:

Algorithm	Contributor	1024 bytes	128 bytes	16 bytes	Average
Esch256	Rhys Weatherley	0.89	0.78	1.50	1.06
Esch384	Rhys Weatherley	0.45	0.37	1.50	0.47
Esch256	Baseline	0.38	0.34	0.65	0.46
Esch384	Baseline	0.26	0.21	0.33	0.26

The SPARKLE-256 and SPARKLE-384 implementations fit entirely within ARM registers, with memory operations at the start and end of the permutation functions only. SPARKLE-512 holds 10 of the 16 state words in registers at a time, and swaps the remaining the words between memory and registers as needed in each round.

TinyJAMBU

I replaced the common TinyJAMBU permutation function with three separate unrolled versions for 128-bit, 192-bit, and 256-bit key sizes. This provided a modest improvement:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
TinyJAMBU-128	Rhys Weatherley	0.71	0.74	1.27	1.27	0.94
TinyJAMBU-192	Rhys Weatherley	0.63	0.67	1.14	1.16	0.85
TinyJAMBU-128	Baseline	0.59	0.62	1.10	1.11	0.81
TinyJAMBU-256	Rhys Weatherley	0.56	0.59	1.04	1.06	0.76
TinyJAMBU-192	Baseline	0.54	0.57	1.01	1.03	0.74
TinyJAMBU-256	Baseline	0.49	0.52	0.94	0.96	0.68

Improvements were also seen on ESP32 and AVR.

I then implemented an ARM Cortex M3 assembly code version of all three permutations, which provides between 38% and 50% improvement over the baseline versions:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
TinyJAMBU-128	Rhys Weatherley	0.93	0.95	1.63	1.61	1.21
TinyJAMBU-192	Rhys Weatherley	0.81	0.84	1.45	1.44	1.08
TinyJAMBU-256	Rhys Weatherley	0.70	0.73	1.28	1.29	0.94
TinyJAMBU-128	Baseline	0.59	0.62	1.10	1.11	0.81
TinyJAMBU-192	Baseline	0.54	0.57	1.01	1.03	0.74
TinyJAMBU-256	Baseline	0.49	0.52	0.94	0.96	0.68

The assembly versions of the TinyJAMBU-128 and TinyJAMBU-192 permutations fit entirely within the ARM's register set. The state words and key words are loaded from memory into registers and kept there until the state words are stored back to memory at the end of the permutation function.

TinyJAMBU-256 almost fits entirely within registers. Three of the eight key words need to be loaded from memory each round because there aren't enough registers to keep the entire 256-bit key cached in registers.

The ARM assembly source code can be found in "src/combined/internal-tinyjambu-arm-cm3.S" in the source tree.

Xoodyak

I implemented ARM Cortex M3 assembly versions of the Xoodoo permutation. The implementation is fully unrolled with the entire state held in registers. The Xoodyak AEAD mode almost doubled in speed:

Algorithm	Contributor	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Xoodyak	Rhys Weatherley	1.66	1.51	1.73	1.60	1.62
Xoodyak	Baseline	0.85	0.87	0.84	0.85	0.86

Similar improvements were seen for the hashing mode:

Algorithm	Contributor	1024 bytes	128 bytes	16 bytes	Average
Xoodyak	Rhys Weatherley	0.71	0.65	1.43	0.93
Xoodyak	Baselin	0.38	0.35	0.79	0.51

Table of Contents