Lightweight Cryptography Primitives
|
NIST set a cut-off of 18 Septeber 2020 for status updates from the Round 2 candidate submission teams. Since that date, some newer implementations have been contributed by others and written by myself.
This page compares the performance of the original baseline versions with the newer submissions for "Phase 2" of the project. The original performance page has been updated with the new figures.
For phase 2, I am focusing mainly on the 32-bit ARM Cortex M3 microprocessor in the Arduino Due device that I used for previous testing. ESP32 and AVR figures may be included if they provide interesting results.
The baseline versions of ASCON-128, ASCON-128a, and ASCON-80pq for 32-bit platforms use the 32-bit bit-sliced representation. Plaintext and associated data is converted into bit-sliced form prior to being absorbed by the permutation. Squeezed ciphertext and the tag are converted from bit-sliced form to regular form on output.
The GASCON core function was part of the DryGASCON submission to Round 2. It is identical to ASCON except that the inputs and outputs of the permutation are already in 32-bit bit-sliced form. If DryGASCON is admitted to Round 3, then the authors have suggested including GASCON in Round 3 as well in their status update.
Sébastien Riou of the DryGASCON submission team contributed versions of ASCON-128, ASCON-128a, and ASCON-80pq where the GASCON permutation was directly substituted for ASCON. This avoids the need to convert back and forth between bit-sliced form and regular form. After applying his patch and adding a few tweaks of my own, replacing ASCON with GASCON provided between 12% and 17% improvement in performance on ARM Cortex M3:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
GASCON-128a | Sébastien Riou | 1.48 | 1.44 | 1.56 | 1.58 | 1.52 |
ASCON-128a | Baseline | 1.29 | 1.26 | 1.33 | 1.32 | 1.30 |
GASCON-128 | Sébastien Riou | 1.10 | 1.10 | 1.41 | 1.39 | 1.25 |
GASCON-80pq | Sébastien Riou | 1.09 | 1.10 | 1.37 | 1.39 | 1.24 |
ASCON-80pq | Baseline | 0.99 | 1.02 | 1.22 | 1.23 | 1.12 |
ASCON-128 | Baseline | 0.99 | 1.02 | 1.22 | 1.23 | 1.11 |
AVR tells a slightly different story because the bit-sliced representation doesn't help improve ASCON performance on AVR due to the lack of a barrel shifter. The two AVR versions are almost identical except for the diffusion layer.
The diffusion layer of ASCON-AVR operates on 64-bit words whereas the diffusion layer of GASCON-AVR operates on 32-bit words. This can lead to slightly more housekeeping for the 32-bit version to deal with two sets of carry bits during rotations. With some extra loop unrolling or clever register management, it may be possible to improve this, but the same trick would also make ASCON-AVR faster.
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
ASCON-128a | Baseline | 2.79 | 2.57 | 4.10 | 3.91 | 3.02 |
GASCON-128a | Rhys Weatherley | 2.68 | 2.51 | 3.88 | 3.74 | 2.92 |
ASCON-80pq | Baseline | 2.05 | 1.95 | 3.65 | 3.52 | 2.36 |
ASCON-128 | Baseline | 2.05 | 1.95 | 3.65 | 3.50 | 2.36 |
GASCON-80pq | Rhys Weatherley | 1.99 | 1.91 | 3.48 | 3.38 | 2.29 |
GASCON-128 | Rhys Weatherley | 1.98 | 1.90 | 3.48 | 3.37 | 2.28 |
I have recently created an assembly code version of 32-bit bit-sliced ASCON for ARM Cortex M3 and similar microprocessors, which can be found within "src/combined/internal-ascon-arm-cm3.S" in the source tree. The assembly code was generated by the code under "src/genarm". For fairness, I also converted GASCON:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
GASCON-128a | Rhys Weatherley | 2.14 | 1.88 | 2.11 | 1.97 | 2.01 |
ASCON-128a | Rhys Weatherley | 1.86 | 1.70 | 1.80 | 1.78 | 1.78 |
GASCON-128 | Rhys Weatherley | 1.67 | 1.54 | 2.03 | 1.88 | 1.77 |
GASCON-80pq | Rhys Weatherley | 1.64 | 1.51 | 1.91 | 1.78 | 1.71 |
ASCON-128 | Rhys Weatherley | 1.54 | 1.44 | 1.78 | 1.68 | 1.61 |
ASCON-80pq | Rhys Weatherley | 1.52 | 1.43 | 1.71 | 1.65 | 1.57 |
ASCON-128a | Baseline | 1.29 | 1.26 | 1.33 | 1.32 | 1.30 |
ASCON-80pq | Baseline | 0.99 | 1.02 | 1.22 | 1.23 | 1.12 |
ASCON-128 | Baseline | 0.99 | 1.02 | 1.22 | 1.23 | 1.11 |
I implemented ARM Cortex M3 assembly versions of the CHAM-128, CHAM-64, and SPECK-64 block ciphers to accelerate COMET:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
COMET-128_CHAM-128/128 | Rhys Weatherley | 1.57 | 1.56 | 2.91 | 2.69 | 2.05 |
COMET-64_SPECK-64/128 | Rhys Weatherley | 1.42 | 1.43 | 2.86 | 2.75 | 1.94 |
COMET-128_CHAM-128/128 | Baseline | 1.22 | 1.25 | 2.20 | 2.08 | 1.61 |
COMET-64_SPECK-64/128 | Baseline | 1.16 | 1.14 | 2.31 | 2.24 | 1.58 |
COMET-64_CHAM-64/128 | Rhys Weatherley | 0.70 | 0.75 | 1.35 | 1.37 | 0.97 |
COMET-64_CHAM-64/128 | Baseline | 0.40 | 0.43 | 0.79 | 0.81 | 0.57 |
Both CHAM-128 and SPECK-64 are fully unrolled and fit entirely within the ARM's register set. Memory loads and stores are only required during function setup and cleanup. CHAM-64 requires some local stack space to hold part of the key schedule as there aren't enough registers.
Sébastien Riou of the DryGASCON submission team contributed ARM Cortex M* assembly code versions of DryGASCON128 with key sizes 16, 32, and 56. The baseline version only had key size 16 and was written in C.
His submission also aligns the "x" words so that the entire "x" array fits within a single cache line in the CPU. This allows him to do away with my complex constant-time method for selecting an "x" word.
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
DryGASCON128k32 | Sébastien Riou | 0.59 | 0.62 | 1.05 | 1.04 | 0.79 |
DryGASCON128k56 | Sébastien Riou | 0.59 | 0.62 | 1.05 | 1.03 | 0.79 |
DryGASCON128k16 | Sébastien Riou | 0.59 | 0.62 | 1.03 | 1.02 | 0.78 |
DryGASCON128k16 | Baseline | 0.16 | 0.18 | 0.28 | 0.30 | 0.22 |
DryGASCON128-HASH shows a similar improvement:
Algorithm | Contributor | 1024 bytes | 128 bytes | 16 bytes | Average |
DryGASCON128-HASH | Sébastien Riou | 0.29 | 0.29 | 0.88 | 0.48 |
DryGASCON128-HASH | Baseline | 0.08 | 0.07 | 0.25 | 0.13 |
I implemented ARM Cortex M3 assembly versions of the GIFT-128 block cipher to support accelerated versions of the the ESTATE, GIFT-COFB, HYENA, and SUNDAE-GIFT submissions to Round 2. Versions were created for the "full", "small", and "tiny" variants of GIFT-128. The following figures are for the full fixsliced variant:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
GIFT-COFB | Rhys Weatherley | 1.01 | 1.01 | 1.16 | 1.15 | 1.08 |
GIFT-COFB | Baseline | 1.02 | 1.01 | 1.09 | 1.09 | 1.05 |
HYENA | Rhys Weatherley | 0.68 | 0.74 | 0.87 | 0.88 | 0.80 |
SUNDAE-GIFT-0 | Rhys Weatherley | 0.57 | 0.61 | 1.04 | 1.05 | 0.78 |
SUNDAE-GIFT-0 | Baseline | 0.58 | 0.62 | 1.01 | 1.02 | 0.77 |
HYENA | Baseline | 0.62 | 0.65 | 0.81 | 0.84 | 0.73 |
ESTATE_TweGIFT-128 | Rhys Weatherley | 0.53 | 0.57 | 1.04 | 1.04 | 0.74 |
SUNDAE-GIFT-64 | Rhys Weatherley | 0.54 | 0.58 | 0.84 | 0.86 | 0.69 |
SUNDAE-GIFT-64 | Baseline | 0.55 | 0.59 | 0.82 | 0.84 | 0.69 |
SUNDAE-GIFT-96 | Rhys Weatherley | 0.54 | 0.58 | 0.83 | 0.85 | 0.69 |
SUNDAE-GIFT-96 | Baseline | 0.55 | 0.59 | 0.81 | 0.83 | 0.68 |
SUNDAE-GIFT-128 | Rhys Weatherley | 0.54 | 0.58 | 0.81 | 0.83 | 0.68 |
SUNDAE-GIFT-128 | Baseline | 0.54 | 0.59 | 0.79 | 0.82 | 0.67 |
ESTATE_TweGIFT-128 | Baseline | 0.48 | 0.51 | 0.92 | 0.92 | 0.66 |
The assembly versions provided a modest improvement in performance, but it wasn't as substantial as for other submissions. The C compiler actually does a pretty good job on my GIFT-128 block cipher implementations in C.
Further improvements will be investigated later. The GIFT authors' ARM implementations have some other tricks that I haven't implemented yet, such as deferring word rotations from one step to be performed during the following step.
I implemented ARM Cortex M3 assembly versions of the GIMLI-24 permutation. The implementation is fully unrolled with the entire state held in registers. The GIMLI-24 AEAD mode shows a 30% improvement over the baseline:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
GIMLI-24 | Rhys Weatherley | 1.08 | 1.09 | 1.29 | 1.28 | 1.18 |
GIMLI-24 | Baseline | 0.84 | 0.85 | 0.97 | 0.98 | 0.91 |
Similar improvements were seen for GIMLI-24-HASH:
Algorithm | Contributor | 1024 bytes | 128 bytes | 16 bytes | Average |
GIMLI-24-HASH | Rhys Weatherley | 0.54 | 0.47 | 0.86 | 0.62 |
GIMLI-24-HASH | Baseline | 0.45 | 0.35 | 0.61 | 0.46 |
Using the ARM Cortex M3 assembly version of ASCON provides an improvement to the performance of ISAP-A:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
ISAP-A-128A | Rhys Weatherley | 0.24 | 0.26 | 0.13 | 0.14 | 0.18 |
ISAP-A-128A | Baseline | 0.17 | 0.19 | 0.10 | 0.11 | 0.13 |
ISAP-A-128 | Rhys Weatherley | 0.08 | 0.08 | 0.03 | 0.04 | 0.05 |
ISAP-A-128 | Baseline | 0.05 | 0.05 | 0.02 | 0.02 | 0.03 |
Unrolling the circulant matrix multiplication step of Pyjamask produces a three-fold performance improvement in the C version of the algorithm:
Algorithm | Version | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Pyjamask-96-AEAD | Unrolled | 0.22 | 0.25 | 0.25 | 0.27 | 0.25 |
Pyjamask-128-AEAD | Unrolled | 0.22 | 0.24 | 0.24 | 0.25 | 0.24 |
Pyjamask-96-AEAD | Baseline | 0.07 | 0.07 | 0.07 | 0.08 | 0.07 |
Pyjamask-128-AEAD | Baseline | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 |
Circulant multiplication has two arguments, a matrix "x" and a state word "y", where the matrix is a constant. The reference implementation from the authors rotates and XOR's the matrix with the result wherever there is a 1 bit in the state word:
However, circulant multiplication is commutative so we can swap the arguments. Because the matrix is a constant, we only need to perform XOR's and rotations for set bits in the matrix and ignore the unset bits. The matrix values in the standard algorithm have an average of 12 set bits, which reduces the number of XOR's and rotations significantly. The resulting implementation is approximately 10 times faster than the baseline version:
Algorithm | Version | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Pyjamask-96-AEAD | Reversed Multiplication | 0.66 | 0.67 | 0.81 | 0.83 | 0.74 |
Pyjamask-128-AEAD | Reversed Multiplication | 0.67 | 0.63 | 0.80 | 0.79 | 0.72 |
Pyjamask-96-AEAD | Baseline | 0.07 | 0.07 | 0.07 | 0.08 | 0.07 |
Pyjamask-128-AEAD | Baseline | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 |
According to the Pyjamask authors, swapping the arguments to circulant multiplication should have no affect on the algorithm's resistance against power analysis when used in masked form.
I also experiemented with ARM Cortex M3 assembly versions of Pyjamask but there wasn't much difference in performance to the plain C version. So for now I am sticking with the C version.
The AVR version of Pyjamask also shows a significant improvement by swapping the arguments:
Algorithm | Version | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Pyjamask-96-AEAD | Rhys Weatherley | 1.47 | 1.45 | 2.13 | 2.11 | 1.64 |
Pyjamask-128-AEAD | Rhys Weatherley | 1.39 | 1.34 | 1.95 | 1.91 | 1.52 |
Pyjamask-96-AEAD | Baseline | 0.66 | 0.67 | 0.96 | 0.96 | 0.74 |
Pyjamask-128-AEAD | Baseline | 0.63 | 0.64 | 0.89 | 0.90 | 0.71 |
The S-box and MDS steps in Saturnin rotate the words of the state at various points. In a previous change, I made the S-box words rotations implicit within the higher level round function. In the latest change, I did the same for MDS. Doing this provides a modest improvement in the performance of the C version:
Algorithm | Version | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
SATURNIN-Short | Rhys Weatherley | 1.82 | 1.66 | 1.73 | ||
SATURNIN-Short | Baseline | 1.62 | 1.69 | 1.66 | ||
SATURNIN-CTR-Cascade | Rhys Weatherley | 0.39 | 0.42 | 0.42 | 0.44 | 0.42 |
SATURNIN-CTR-Cascade | Baseline | 0.34 | 0.36 | 0.37 | 0.38 | 0.36 |
Similar performance improvements are also seen for SATURNIN-Hash:
Algorithm | Contributor | 1024 bytes | 128 bytes | 16 bytes | Average |
SATURNIN-Hash | Rhys Weatherley | 0.28 | 0.23 | 0.57 | 0.36 |
SATURNIN-Hash | Baseline | 0.24 | 0.20 | 0.49 | 0.31 |
I implemented fully unrolled ARM Cortex M3 assembly versions of the SPARKLE-256, SPARKLE-384, and SPARKLE-512 permutations. There was up to a 70% improvement in performance in some of the algorithms.
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Schwaemm128-128 | Rhys Weatherley | 1.60 | 1.58 | 2.84 | 2.39 | 2.01 |
Schwaemm256-128 | Rhys Weatherley | 1.74 | 1.63 | 1.90 | 1.93 | 1.80 |
Schwaemm192-192 | Rhys Weatherley | 1.47 | 1.50 | 1.98 | 1.81 | 1.68 |
Schwaemm128-128 | Baseline | 1.17 | 1.15 | 1.93 | 1.80 | 1.46 |
Schwaemm256-256 | Rhys Weatherley | 1.18 | 1.16 | 1.15 | 1.09 | 1.14 |
Schwaemm256-128 | Baseline | 1.08 | 1.12 | 1.08 | 1.10 | 1.09 |
Schwaemm192-192 | Baseline | 0.90 | 0.92 | 1.04 | 1.07 | 0.99 |
Schwaemm256-256 | Baseline | 0.79 | 0.80 | 0.74 | 0.72 | 0.76 |
The improvements to hashing performance was even more spectacular:
Algorithm | Contributor | 1024 bytes | 128 bytes | 16 bytes | Average |
Esch256 | Rhys Weatherley | 0.89 | 0.78 | 1.50 | 1.06 |
Esch384 | Rhys Weatherley | 0.45 | 0.37 | 1.50 | 0.47 |
Esch256 | Baseline | 0.38 | 0.34 | 0.65 | 0.46 |
Esch384 | Baseline | 0.26 | 0.21 | 0.33 | 0.26 |
The SPARKLE-256 and SPARKLE-384 implementations fit entirely within ARM registers, with memory operations at the start and end of the permutation functions only. SPARKLE-512 holds 10 of the 16 state words in registers at a time, and swaps the remaining the words between memory and registers as needed in each round.
I replaced the common TinyJAMBU permutation function with three separate unrolled versions for 128-bit, 192-bit, and 256-bit key sizes. This provided a modest improvement:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
TinyJAMBU-128 | Rhys Weatherley | 0.71 | 0.74 | 1.27 | 1.27 | 0.94 |
TinyJAMBU-192 | Rhys Weatherley | 0.63 | 0.67 | 1.14 | 1.16 | 0.85 |
TinyJAMBU-128 | Baseline | 0.59 | 0.62 | 1.10 | 1.11 | 0.81 |
TinyJAMBU-256 | Rhys Weatherley | 0.56 | 0.59 | 1.04 | 1.06 | 0.76 |
TinyJAMBU-192 | Baseline | 0.54 | 0.57 | 1.01 | 1.03 | 0.74 |
TinyJAMBU-256 | Baseline | 0.49 | 0.52 | 0.94 | 0.96 | 0.68 |
Improvements were also seen on ESP32 and AVR.
I then implemented an ARM Cortex M3 assembly code version of all three permutations, which provides between 38% and 50% improvement over the baseline versions:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
TinyJAMBU-128 | Rhys Weatherley | 0.93 | 0.95 | 1.63 | 1.61 | 1.21 |
TinyJAMBU-192 | Rhys Weatherley | 0.81 | 0.84 | 1.45 | 1.44 | 1.08 |
TinyJAMBU-256 | Rhys Weatherley | 0.70 | 0.73 | 1.28 | 1.29 | 0.94 |
TinyJAMBU-128 | Baseline | 0.59 | 0.62 | 1.10 | 1.11 | 0.81 |
TinyJAMBU-192 | Baseline | 0.54 | 0.57 | 1.01 | 1.03 | 0.74 |
TinyJAMBU-256 | Baseline | 0.49 | 0.52 | 0.94 | 0.96 | 0.68 |
The assembly versions of the TinyJAMBU-128 and TinyJAMBU-192 permutations fit entirely within the ARM's register set. The state words and key words are loaded from memory into registers and kept there until the state words are stored back to memory at the end of the permutation function.
TinyJAMBU-256 almost fits entirely within registers. Three of the eight key words need to be loaded from memory each round because there aren't enough registers to keep the entire 256-bit key cached in registers.
The ARM assembly source code can be found in "src/combined/internal-tinyjambu-arm-cm3.S" in the source tree.
I implemented ARM Cortex M3 assembly versions of the Xoodoo permutation. The implementation is fully unrolled with the entire state held in registers. The Xoodyak AEAD mode almost doubled in speed:
Algorithm | Contributor | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Xoodyak | Rhys Weatherley | 1.66 | 1.51 | 1.73 | 1.60 | 1.62 |
Xoodyak | Baseline | 0.85 | 0.87 | 0.84 | 0.85 | 0.86 |
Similar improvements were seen for the hashing mode:
Algorithm | Contributor | 1024 bytes | 128 bytes | 16 bytes | Average |
Xoodyak | Rhys Weatherley | 0.71 | 0.65 | 1.43 | 0.93 |
Xoodyak | Baselin | 0.38 | 0.35 | 0.79 | 0.51 |