Lightweight Cryptography Primitives
 All Data Structures Files Functions Variables Typedefs Macros Pages
Phase 2 Performance Figures

Table of Contents

Introduction

NIST set a cut-off of 18 Septeber 2020 for status updates from the Round 2 candidate submission teams. Since that date, some newer implementations have been contributed by others and written by myself.

This page compares the performance of the original baseline versions with the newer submissions for "Phase 2" of the project. The original performance page has been updated with the new figures.

For phase 2, I am focusing mainly on the 32-bit ARM Cortex M3 microprocessor in the Arduino Due device that I used for previous testing. ESP32 and AVR figures may be included if they provide interesting results.

ASCON and GASCON

The baseline versions of ASCON-128, ASCON-128a, and ASCON-80pq for 32-bit platforms use the 32-bit bit-sliced representation. Plaintext and associated data is converted into bit-sliced form prior to being absorbed by the permutation. Squeezed ciphertext and the tag are converted from bit-sliced form to regular form on output.

The GASCON core function was part of the DryGASCON submission to Round 2. It is identical to ASCON except that the inputs and outputs of the permutation are already in 32-bit bit-sliced form. If DryGASCON is admitted to Round 3, then the authors have suggested including GASCON in Round 3 as well in their status update.

Sébastien Riou of the DryGASCON submission team contributed versions of ASCON-128, ASCON-128a, and ASCON-80pq where the GASCON permutation was directly substituted for ASCON. This avoids the need to convert back and forth between bit-sliced form and regular form. After applying his patch and adding a few tweaks of my own, replacing ASCON with GASCON provided between 12% and 17% improvement in performance on ARM Cortex M3:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
GASCON-128aSébastien Riou1.481.441.561.581.52
ASCON-128aBaseline1.291.261.331.321.30
GASCON-128Sébastien Riou1.101.101.411.391.25
GASCON-80pqSébastien Riou1.091.101.371.391.24
ASCON-80pqBaseline0.991.021.221.231.12
ASCON-128Baseline0.991.021.221.231.11

AVR tells a slightly different story because the bit-sliced representation doesn't help improve ASCON performance on AVR due to the lack of a barrel shifter. The two AVR versions are almost identical except for the diffusion layer.

The diffusion layer of ASCON-AVR operates on 64-bit words whereas the diffusion layer of GASCON-AVR operates on 32-bit words. This can lead to slightly more housekeeping for the 32-bit version to deal with two sets of carry bits during rotations. With some extra loop unrolling or clever register management, it may be possible to improve this, but the same trick would also make ASCON-AVR faster.

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
ASCON-128aBaseline2.792.574.103.913.02
GASCON-128aRhys Weatherley2.682.513.883.742.92
ASCON-80pqBaseline2.051.953.653.522.36
ASCON-128Baseline2.051.953.653.502.36
GASCON-80pqRhys Weatherley1.991.913.483.382.29
GASCON-128Rhys Weatherley1.981.903.483.372.28

I have recently created an assembly code version of 32-bit bit-sliced ASCON for ARM Cortex M3 and similar microprocessors, which can be found within "src/combined/internal-ascon-arm-cm3.S" in the source tree. The assembly code was generated by the code under "src/genarm". For fairness, I also converted GASCON:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
GASCON-128aRhys Weatherley2.141.882.111.972.01
ASCON-128aRhys Weatherley1.861.701.801.781.78
GASCON-128Rhys Weatherley1.671.542.031.881.77
GASCON-80pqRhys Weatherley1.641.511.911.781.71
ASCON-128Rhys Weatherley1.541.441.781.681.61
ASCON-80pqRhys Weatherley1.521.431.711.651.57
ASCON-128aBaseline1.291.261.331.321.30
ASCON-80pqBaseline0.991.021.221.231.12
ASCON-128Baseline0.991.021.221.231.11

COMET

I implemented ARM Cortex M3 assembly versions of the CHAM-128, CHAM-64, and SPECK-64 block ciphers to accelerate COMET:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
COMET-128_CHAM-128/128Rhys Weatherley1.571.562.912.692.05
COMET-64_SPECK-64/128Rhys Weatherley1.421.432.862.751.94
COMET-128_CHAM-128/128Baseline1.221.252.202.081.61
COMET-64_SPECK-64/128Baseline1.161.142.312.241.58
COMET-64_CHAM-64/128Rhys Weatherley0.700.751.351.370.97
COMET-64_CHAM-64/128Baseline0.400.430.790.810.57

Both CHAM-128 and SPECK-64 are fully unrolled and fit entirely within the ARM's register set. Memory loads and stores are only required during function setup and cleanup. CHAM-64 requires some local stack space to hold part of the key schedule as there aren't enough registers.

DryGASCON

Sébastien Riou of the DryGASCON submission team contributed ARM Cortex M* assembly code versions of DryGASCON128 with key sizes 16, 32, and 56. The baseline version only had key size 16 and was written in C.

His submission also aligns the "x" words so that the entire "x" array fits within a single cache line in the CPU. This allows him to do away with my complex constant-time method for selecting an "x" word.

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
DryGASCON128k32Sébastien Riou0.590.621.051.040.79
DryGASCON128k56Sébastien Riou0.590.621.051.030.79
DryGASCON128k16Sébastien Riou0.590.621.031.020.78
DryGASCON128k16Baseline0.160.180.280.300.22

DryGASCON128-HASH shows a similar improvement:

AlgorithmContributor1024 bytes128 bytes16 bytesAverage
DryGASCON128-HASHSébastien Riou0.290.290.880.48
DryGASCON128-HASHBaseline0.080.070.250.13

GIFT-128

I implemented ARM Cortex M3 assembly versions of the GIFT-128 block cipher to support accelerated versions of the the ESTATE, GIFT-COFB, HYENA, and SUNDAE-GIFT submissions to Round 2. Versions were created for the "full", "small", and "tiny" variants of GIFT-128. The following figures are for the full fixsliced variant:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
GIFT-COFBRhys Weatherley1.011.011.161.151.08
GIFT-COFBBaseline1.021.011.091.091.05
HYENARhys Weatherley0.680.740.870.880.80
SUNDAE-GIFT-0Rhys Weatherley0.570.611.041.050.78
SUNDAE-GIFT-0Baseline0.580.621.011.020.77
HYENABaseline0.620.650.810.840.73
ESTATE_TweGIFT-128Rhys Weatherley0.530.571.041.040.74
SUNDAE-GIFT-64Rhys Weatherley0.540.580.840.860.69
SUNDAE-GIFT-64Baseline0.550.590.820.840.69
SUNDAE-GIFT-96Rhys Weatherley0.540.580.830.850.69
SUNDAE-GIFT-96Baseline0.550.590.810.830.68
SUNDAE-GIFT-128Rhys Weatherley0.540.580.810.830.68
SUNDAE-GIFT-128Baseline0.540.590.790.820.67
ESTATE_TweGIFT-128Baseline0.480.510.920.920.66

The assembly versions provided a modest improvement in performance, but it wasn't as substantial as for other submissions. The C compiler actually does a pretty good job on my GIFT-128 block cipher implementations in C.

Further improvements will be investigated later. The GIFT authors' ARM implementations have some other tricks that I haven't implemented yet, such as deferring word rotations from one step to be performed during the following step.

Gimli

I implemented ARM Cortex M3 assembly versions of the GIMLI-24 permutation. The implementation is fully unrolled with the entire state held in registers. The GIMLI-24 AEAD mode shows a 30% improvement over the baseline:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
GIMLI-24Rhys Weatherley1.081.091.291.281.18
GIMLI-24Baseline0.840.850.970.980.91

Similar improvements were seen for GIMLI-24-HASH:

AlgorithmContributor1024 bytes128 bytes16 bytesAverage
GIMLI-24-HASHRhys Weatherley0.540.470.860.62
GIMLI-24-HASHBaseline0.450.350.610.46

ISAP

Using the ARM Cortex M3 assembly version of ASCON provides an improvement to the performance of ISAP-A:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
ISAP-A-128ARhys Weatherley0.240.260.130.140.18
ISAP-A-128ABaseline0.170.190.100.110.13
ISAP-A-128Rhys Weatherley0.080.080.030.040.05
ISAP-A-128Baseline0.050.050.020.020.03

Pyjamask

Unrolling the circulant matrix multiplication step of Pyjamask produces a three-fold performance improvement in the C version of the algorithm:

AlgorithmVersionEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Pyjamask-96-AEADUnrolled0.220.250.250.270.25
Pyjamask-128-AEADUnrolled0.220.240.240.250.24
Pyjamask-96-AEADBaseline0.070.070.070.080.07
Pyjamask-128-AEADBaseline0.060.070.070.070.07

Circulant multiplication has two arguments, a matrix "x" and a state word "y", where the matrix is a constant. The reference implementation from the authors rotates and XOR's the matrix with the result wherever there is a 1 bit in the state word:

result ^= x & -((y >> bit) & 1);
x = rightRotate1(x);

However, circulant multiplication is commutative so we can swap the arguments. Because the matrix is a constant, we only need to perform XOR's and rotations for set bits in the matrix and ignore the unset bits. The matrix values in the standard algorithm have an average of 12 set bits, which reduces the number of XOR's and rotations significantly. The resulting implementation is approximately 10 times faster than the baseline version:

AlgorithmVersionEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Pyjamask-96-AEADReversed Multiplication0.660.670.810.830.74
Pyjamask-128-AEADReversed Multiplication0.670.630.800.790.72
Pyjamask-96-AEADBaseline0.070.070.070.080.07
Pyjamask-128-AEADBaseline0.060.070.070.070.07

According to the Pyjamask authors, swapping the arguments to circulant multiplication should have no affect on the algorithm's resistance against power analysis when used in masked form.

I also experiemented with ARM Cortex M3 assembly versions of Pyjamask but there wasn't much difference in performance to the plain C version. So for now I am sticking with the C version.

The AVR version of Pyjamask also shows a significant improvement by swapping the arguments:

AlgorithmVersionEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Pyjamask-96-AEADRhys Weatherley1.471.452.132.111.64
Pyjamask-128-AEADRhys Weatherley1.391.341.951.911.52
Pyjamask-96-AEADBaseline0.660.670.960.960.74
Pyjamask-128-AEADBaseline0.630.640.890.900.71

Saturnin

The S-box and MDS steps in Saturnin rotate the words of the state at various points. In a previous change, I made the S-box words rotations implicit within the higher level round function. In the latest change, I did the same for MDS. Doing this provides a modest improvement in the performance of the C version:

AlgorithmVersionEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
SATURNIN-ShortRhys Weatherley1.821.661.73
SATURNIN-ShortBaseline1.621.691.66
SATURNIN-CTR-CascadeRhys Weatherley0.390.420.420.440.42
SATURNIN-CTR-CascadeBaseline0.340.360.370.380.36

Similar performance improvements are also seen for SATURNIN-Hash:

AlgorithmContributor1024 bytes128 bytes16 bytesAverage
SATURNIN-HashRhys Weatherley0.280.230.570.36
SATURNIN-HashBaseline0.240.200.490.31

SPARKLE

I implemented fully unrolled ARM Cortex M3 assembly versions of the SPARKLE-256, SPARKLE-384, and SPARKLE-512 permutations. There was up to a 70% improvement in performance in some of the algorithms.

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Schwaemm128-128Rhys Weatherley1.601.582.842.392.01
Schwaemm256-128Rhys Weatherley1.741.631.901.931.80
Schwaemm192-192Rhys Weatherley1.471.501.981.811.68
Schwaemm128-128Baseline1.171.151.931.801.46
Schwaemm256-256Rhys Weatherley1.181.161.151.091.14
Schwaemm256-128Baseline1.081.121.081.101.09
Schwaemm192-192Baseline0.900.921.041.070.99
Schwaemm256-256Baseline0.790.800.740.720.76

The improvements to hashing performance was even more spectacular:

AlgorithmContributor1024 bytes128 bytes16 bytesAverage
Esch256Rhys Weatherley0.890.781.501.06
Esch384Rhys Weatherley0.450.371.500.47
Esch256Baseline0.380.340.650.46
Esch384Baseline0.260.210.330.26

The SPARKLE-256 and SPARKLE-384 implementations fit entirely within ARM registers, with memory operations at the start and end of the permutation functions only. SPARKLE-512 holds 10 of the 16 state words in registers at a time, and swaps the remaining the words between memory and registers as needed in each round.

TinyJAMBU

I replaced the common TinyJAMBU permutation function with three separate unrolled versions for 128-bit, 192-bit, and 256-bit key sizes. This provided a modest improvement:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
TinyJAMBU-128Rhys Weatherley0.710.741.271.270.94
TinyJAMBU-192Rhys Weatherley0.630.671.141.160.85
TinyJAMBU-128Baseline0.590.621.101.110.81
TinyJAMBU-256Rhys Weatherley0.560.591.041.060.76
TinyJAMBU-192Baseline0.540.571.011.030.74
TinyJAMBU-256Baseline0.490.520.940.960.68

Improvements were also seen on ESP32 and AVR.

I then implemented an ARM Cortex M3 assembly code version of all three permutations, which provides between 38% and 50% improvement over the baseline versions:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
TinyJAMBU-128Rhys Weatherley0.930.951.631.611.21
TinyJAMBU-192Rhys Weatherley0.810.841.451.441.08
TinyJAMBU-256Rhys Weatherley0.700.731.281.290.94
TinyJAMBU-128Baseline0.590.621.101.110.81
TinyJAMBU-192Baseline0.540.571.011.030.74
TinyJAMBU-256Baseline0.490.520.940.960.68

The assembly versions of the TinyJAMBU-128 and TinyJAMBU-192 permutations fit entirely within the ARM's register set. The state words and key words are loaded from memory into registers and kept there until the state words are stored back to memory at the end of the permutation function.

TinyJAMBU-256 almost fits entirely within registers. Three of the eight key words need to be loaded from memory each round because there aren't enough registers to keep the entire 256-bit key cached in registers.

The ARM assembly source code can be found in "src/combined/internal-tinyjambu-arm-cm3.S" in the source tree.

Xoodyak

I implemented ARM Cortex M3 assembly versions of the Xoodoo permutation. The implementation is fully unrolled with the entire state held in registers. The Xoodyak AEAD mode almost doubled in speed:

AlgorithmContributorEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
XoodyakRhys Weatherley1.661.511.731.601.62
XoodyakBaseline0.850.870.840.850.86

Similar improvements were seen for the hashing mode:

AlgorithmContributor1024 bytes128 bytes16 bytesAverage
XoodyakRhys Weatherley0.710.651.430.93
XoodyakBaselin0.380.350.790.51