Lightweight Cryptography Primitives
|
There is a lot of variation in the capabilities of embedded microprocessors. Some are superscalar; others are not. Some have specialised vector instructions; others do not. Clock speeds can also vary considerably. All this means that "cycles per byte" or "megabytes per second" are pretty meaningless when trying to rank the algorithms on relative performance on any given microprocessor.
The approach I take here is "ChaChaPoly Units". The library contains a reasonably efficient 32-bit non-vectorized implementation of the ChaChaPoly AEAD scheme from my Arduino cryptography library. This makes it a known quanitity to compare with other algorithms side by side.
If an algorithm is measured at 0.8 ChaChaPoly Units on a specific embedded microprocessor at a specific clock speed, then that means that it is slower than ChaChaPoly by a factor of 0.8 on that microprocessor. If the algorithm is instead measured at 2 ChaChaPoly Units, then it is twice as fast as ChaChaPoly on the same microprocessor. The higher the number of units, the better the algorithm.
The number of ChaChaPoly Units for each algorithm will vary for each microprocessor that is tested and for different choices of optimisation options. The figures below should be used as a rough guide to the relative performance of the algorithms, not an absolute measurement.
For hash algorithms we use BLAKE2s as the basic unit. BLAKE2s is based on ChaCha20 so it is the most logical hashing counterpart to ChaChaPoly.
This page details the performance results for 32-bit platforms. A separate page that details preliminary results for the 8-bit AVR platform can be found here.
The masking performance page contains comparisons of masked versions of the algorithms with their baseline versions.
All tests were run on an Arduino Due which is an ARM Cortex M3 running at 84MHz. The code was optimised for size rather than speed, which is the default optimisation option for the Arduino IDE. I found that "-Os" size optimisation often did better on the Due than "-O2" or "-O3" with the compiler that I had. Your own results may vary.
Each algorithm was tested with two packet sizes: 128 and 16 bytes. Some algorithms can have better performance on small packet sizes. The associated data is always zero-length.
The value in the table below indicates the number of times faster than ChaChaPoly on the same packet. Higher numbers mean better performance. The table is ordered from best average performance down.
Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family.
All algorithms have been accelerated to some degree with armv7m-compatible assembly code.
Algorithm | Key Bits | Nonce Bits | Tag Bits | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Schwaemm128-128 (SPARKLE) | 128 | 128 | 128 | 1.66 | 1.62 | 3.08 | 2.53 | 2.11 |
Xoodyak | 128 | 128 | 128 | 1.77 | 1.66 | 2.34 | 2.16 | 1.97 |
Schwaemm256-128 (SPARKLE) | 128 | 256 | 128 | 1.93 | 1.61 | 2.13 | 1.97 | 1.90 |
Schwaemm192-192 (SPARKLE) | 192 | 192 | 192 | 1.75 | 1.52 | 2.12 | 1.88 | 1.80 |
ASCON-128a | 128 | 128 | 128 | 1.86 | 1.70 | 1.80 | 1.78 | 1.78 |
ASCON-128 | 128 | 128 | 128 | 1.54 | 1.44 | 1.78 | 1.68 | 1.61 |
ASCON-80pq | 160 | 128 | 128 | 1.52 | 1.43 | 1.71 | 1.65 | 1.57 |
Schwaemm256-256 (SPARKLE) | 256 | 256 | 256 | 1.23 | 1.16 | 1.22 | 1.11 | 1.18 |
TinyJAMBU-128 | 128 | 96 | 64 | 0.87 | 0.89 | 1.58 | 1.57 | 1.17 |
GIFT-COFB | 128 | 128 | 128 | 1.01 | 1.01 | 1.16 | 1.15 | 1.08 |
TinyJAMBU-192 | 192 | 96 | 64 | 0.73 | 0.76 | 1.35 | 1.36 | 1.00 |
TinyJAMBU-256 | 256 | 96 | 64 | 0.67 | 0.70 | 1.27 | 1.28 | 0.93 |
Grain-128AEAD | 128 | 96 | 64 | 0.30 | 0.33 | 0.63 | 0.67 | 0.45 |
AES-128-GCM | 128 | 96 | 128 | 0.36 | 0.38 | 0.50 | 0.52 | 0.44 |
AES-192-GCM | 192 | 96 | 128 | 0.34 | 0.35 | 0.46 | 0.48 | 0.40 |
AES-256-GCM | 256 | 96 | 128 | 0.31 | 0.33 | 0.42 | 0.44 | 0.37 |
Romulus-N | 128 | 128 | 128 | 0.27 | 0.29 | 0.32 | 0.34 | 0.31 |
Delirium (Elephant) | 128 | 96 | 128 | 0.23 | 0.25 | 0.38 | 0.39 | 0.30 |
PHOTON-Beetle-AEAD-ENC-128 | 128 | 128 | 128 | 0.18 | 0.20 | 0.33 | 0.35 | 0.25 |
Romulus-M | 128 | 128 | 128 | 0.16 | 0.17 | 0.22 | 0.23 | 0.19 |
ISAP-A-128A | 128 | 128 | 128 | 0.24 | 0.26 | 0.13 | 0.14 | 0.18 |
Romulus-T | 128 | 128 | 128 | 0.07 | 0.07 | 0.10 | 0.11 | 0.09 |
PHOTON-Beetle-AEAD-ENC-32 | 128 | 128 | 128 | 0.05 | 0.06 | 0.13 | 0.14 | 0.08 |
ISAP-A-128 | 128 | 128 | 128 | 0.08 | 0.08 | 0.03 | 0.04 | 0.05 |
ISAP-K-128A | 128 | 128 | 128 | 0.07 | 0.07 | 0.04 | 0.04 | 0.05 |
Dumbo (Elephant) | 128 | 96 | 64 | 0.03 | 0.03 | 0.05 | 0.05 | 0.04 |
Jumbo (Elephant) | 128 | 96 | 64 | 0.03 | 0.03 | 0.04 | 0.04 | 0.04 |
ISAP-K-128 | 128 | 128 | 128 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
The hash algorithms are compared against BLAKE2s instead of ChaChaPoly:
Algorithm | Hash Bits | 1024 bytes | 128 bytes | 16 bytes | Average |
Esch256 (SPARKLE) | 256 | 0.90 | 0.79 | 1.51 | 1.07 |
SHA256-ASM | 256 | 1.11 | 0.79 | 1.15 | 1.02 |
Xoodyak | 256 | 0.71 | 0.65 | 1.43 | 0.93 |
ASCON-HASHA | 256 | 0.63 | 0.46 | 0.58 | 0.56 |
SHA256-C | 256 | 0.56 | 0.40 | 0.60 | 0.52 |
ASCON-HASH | 256 | 0.48 | 0.38 | 0.57 | 0.48 |
Esch384 (SPARKLE) | 384 | 0.46 | 0.38 | 0.59 | 0.48 |
Romulus-H | 256 | 0.10 | 0.09 | 0.22 | 0.14 |
PHOTON-Beetle-HASH | 256 | 0.02 | 0.02 | 0.16 | 0.07 |
SHA256-ASM uses a fully unrolled version of the SHA256 block transformation function in armv7m-compatible assembly code. SHA256-C is a straight-forward C version of SHA256 with very little unrolling, designed for small code and memory sizes.
The speed of SHA256 compared with the other candidates can be attributed in part to the "rate" of 64 bytes for SHA256, which allows it to process more data per block operation. The other algorithms have rates of 4, 8, 16, or 32 bytes. If all algorithms had the same rate, then the ordering would be more like this:
Algorithm | Average | Rate | Average / Rate |
ASCON-HASHA | 0.56 | 8 | 0.0700 |
Esch256 (SPARKLE) | 1.07 | 16 | 0.0669 |
ASCON-HASH | 0.52 | 8 | 0.0650 |
Xoodyak | 0.93 | 16 | 0.0581 |
Esch384 (SPARKLE) | 0.48 | 16 | 0.0300 |
PHOTON-Beetle-HASH | 0.07 | 4 | 0.0175 |
SHA256-ASM | 1.02 | 64 | 0.0159 |
SHA256-C | 0.52 | 64 | 0.0081 |
Romulus-H | 0.14 | 32 | 0.0044 |
The tests below were run on an ESP32 Dev Module running at 240MHz. The ordering is mostly the same as ARM Cortext M3 with a few reversals where the architectural differences gives some algorithms an added advantage.
Algorithm | Key Bits | Nonce Bits | Tag Bits | Encrypt 128 bytes | Decrypt 128 bytes | Encrypt 16 bytes | Decrypt 16 bytes | Average |
Schwaemm128-128 (SPARKLE) | 128 | 128 | 128 | 1.07 | 1.06 | 1.68 | 1.60 | 1.32 |
Schwaemm256-128 (SPARKLE) | 128 | 256 | 128 | 1.11 | 1.09 | 1.04 | 1.04 | 1.06 |
Xoodyak | 128 | 128 | 128 | 0.91 | 0.92 | 1.06 | 1.07 | 0.99 |
Schwaemm192-192 (SPARKLE) | 192 | 192 | 192 | 0.87 | 0.90 | 1.02 | 1.00 | 0.95 |
ASCON-128a | 128 | 128 | 128 | 0.86 | 0.88 | 0.92 | 0.93 | 0.90 |
GIFT-COFB | 128 | 128 | 128 | 0.80 | 0.83 | 0.90 | 0.90 | 0.86 |
TinyJAMBU-128 | 128 | 96 | 64 | 0.62 | 0.64 | 1.12 | 1.12 | 0.83 |
TinyJAMBU-192 | 192 | 96 | 64 | 0.55 | 0.57 | 1.01 | 1.02 | 0.75 |
Schwaemm256-256 (SPARKLE) | 256 | 256 | 256 | 0.77 | 0.78 | 0.70 | 0.70 | 0.73 |
AES-128-GCM | 128 | 96 | 128 | 0.59 | 0.60 | 0.82 | 0.83 | 0.70 |
AES-192-GCM | 192 | 96 | 128 | 0.54 | 0.56 | 0.76 | 0.77 | 0.65 |
TinyJAMBU-256 | 256 | 96 | 64 | 0.47 | 0.49 | 0.89 | 0.91 | 0.65 |
ASCON-128 | 128 | 128 | 128 | 0.67 | 0.46 | 0.86 | 0.66 | 0.63 |
ASCON-80pq | 160 | 128 | 128 | 0.67 | 0.44 | 0.84 | 0.61 | 0.61 |
AES-256-GCM | 256 | 96 | 128 | 0.50 | 0.52 | 0.68 | 0.69 | 0.59 |
Grain-128AEAD | 128 | 96 | 64 | 0.33 | 0.32 | 0.60 | 0.59 | 0.43 |
PHOTON-Beetle-AEAD-ENC-128 | 128 | 128 | 128 | 0.16 | 0.18 | 0.30 | 0.32 | 0.23 |
Romulus-N | 128 | 128 | 128 | 0.17 | 0.20 | 0.18 | 0.24 | 0.20 |
Delirium (Elephant) | 128 | 96 | 128 | 0.14 | 0.15 | 0.22 | 0.23 | 0.18 |
Romulus-M | 128 | 128 | 128 | 0.09 | 0.11 | 0.12 | 0.17 | 0.12 |
ISAP-A-128A | 128 | 128 | 128 | 0.13 | 0.15 | 0.08 | 0.09 | 0.10 |
PHOTON-Beetle-AEAD-ENC-32 | 128 | 128 | 128 | 0.04 | 0.05 | 0.12 | 0.13 | 0.07 |
Romulus-T | 128 | 128 | 128 | 0.04 | 0.05 | 0.07 | 0.09 | 0.06 |
ISAP-K-128A | 128 | 128 | 128 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 |
ISAP-A-128 | 128 | 128 | 128 | 0.03 | 0.03 | 0.01 | 0.02 | 0.02 |
Dumbo (Elephant) | 128 | 96 | 64 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 |
Jumbo (Elephant) | 128 | 96 | 64 | 0.01 | 0.02 | 0.02 | 0.02 | 0.02 |
ISAP-K-128 | 128 | 128 | 128 | 0.0040 | 0.0047 | 0.0018 | 0.0020 | 0.0025 |
Hash algorithms:
Algorithm | Hash Bits | 1024 bytes | 128 bytes | 16 bytes | Average |
Xoodyak | 256 | 0.35 | 0.33 | 0.73 | 0.47 |
SHA256-C | 256 | 0.47 | 0.37 | 0.55 | 0.47 |
Esch256 (SPARKLE) | 256 | 0.38 | 0.34 | 0.64 | 0.45 |
Esch384 (SPARKLE) | 384 | 0.24 | 0.20 | 0.30 | 0.25 |
ASCON-HASHA | 256 | 0.27 | 0.20 | 0.25 | 0.24 |
ASCON-HASH | 256 | 0.19 | 0.16 | 0.24 | 0.20 |
Romulus-H | 256 | 0.07 | 0.06 | 0.09 | 0.09 |
PHOTON-Beetle-HASH | 256 | 0.02 | 0.02 | 0.15 | 0.06 |
SHA256-C does quite well on ESP32. In large part this is because SHA256's "rate" is 64 bytes which allows it to process more data per block operation than the other algorithms whose rate is 4, 8, 16, or 32 bytes per block operation.
All of the algorithms suffer on ESP32 because the CPU does not have a native word rotation instruction. BLAKE2s and SHA256 have a lower percentage of word rotations per round, so they are less affected by the CPU's shortcomings.
Based on the above data, the NIST submissions can be roughly grouped with those of similar performance. Changes in CPU, optimisation options, loop unrolling, or assembly code replacement might modify the rank of an algorithm.
Only the primary algorithm in each family is considered for this ranking. I took the average of the ARM Cortex M3 and ESP32 figures from the above tables to compute an average across different architectures. I then grouped the algorithms into 0.1-wide buckets; for example everything with rank 3 has an average between 0.30 and 0.39 ChaChaPoly units.
AEAD algorithm rankings:
Rank | Algorithms |
14 | SPARKLE, Xoodyak |
11 | ASCON |
10 | TinyJAMBU |
9 | GIFT-COFB |
5 | AES-128-GCM |
4 | Grain128-AEAD |
2 | PHOTON-Beetle, Romulus |
1 | ISAP |
0 | Elephant |
Hash algorithm rankings:
Rank | Algorithms |
7 | SHA256, SPARKLE, Xoodyak |
3 | ASCON |
0 | PHOTON-Beetle, Romulus |
There have been many improvements to the performance of my implementations since Round 2, and some tweaks to the algorithms themselves to change the number of rounds or other aspects of the algorithms. This section summarises the changes.
ARM Cortex M3 has seen the greatest performance improvement with the introduction of assembly code versions of most algorithms. We compare the baseline C versions from Round 2 with the current ChaChaPoly figures.
I did have some ARM Cortex M3 assembly code versions in my Round 2 repository, but they were implemented after the cut-off date for Round 2 status updates.
Changes in the primary AEAD algorithm performance for ARM Cortex M3, ordered from highest to lowest "New" ChaChaPoly values:
Algorithm | Round 2 | New | Notable changes other than the use of assembly code |
Xoodyak | 0.86 | 1.97 | Final round tweak improved performance on small packets |
SPARKLE | 1.09 | 1.90 | |
ASCON | 1.11 | 1.61 | |
TinyJAMBU | 0.81 | 1.17 | |
GIFT-COFB | 1.05 | 1.08 | |
Grain128-AEAD | 0.37 | 0.45 | |
Romulus | 0.19 | 0.31 | Switched to fixsliced SKINNY-128-384+ |
Elephant (Delirium) | 0.05 | 0.30 | Optimised 32-bit and 64-bit versions of Keccak-p[200] in C |
PHOTON-Beetle | 0.08 | 0.25 | Highly unrolled 32-bit version in C |
ISAP-A | 0.13 | 0.18 | |
ISAP-K | 0.02 | 0.05 | Optimised 64-bit version of Keccak-p[400] in C |
Elephant (Dumbo) | 0.02 | 0.04 | Improved bit-sliced implementation of Spongent |
Note: The primary version of Elephant is the Spongent-based Dumbo, but the Keccak-based Delirium has improved significantly so I included that as well.
Changes in the primary hash algorithm performance for ARM Cortex M3:
Algorithm | Round 2 | New |
SPARKLE | 0.46 | 1.07 |
Xoodyak | 0.51 | 0.93 |
ASCON | 0.30 | 0.48 |
Romulus | N/A | 0.14 |
PHOTON-Beetle | 0.02 | 0.07 |
The ESP32 implementations are still in C, so the improvements in the AEAD encryption schemes were more modest with a few notable changes:
Algorithm | Round 2 | New | Notable changes |
SPARKLE | 1.06 | 1.06 | |
Xoodyak | 0.83 | 0.99 | Final round tweak improved performance on small packets |
TinyJAMBU | 0.71 | 0.83 | Separate the permutations for 128, 192, and 256 bit key sizes and unroll |
GIFT-COFB | 0.86 | 0.86 | |
ASCON | 0.63 | 0.63 | |
Grain128-AEAD | 0.43 | 0.43 | |
PHOTON-Beetle | 0.08 | 0.23 | Highly unrolled 32-bit version in C |
Romulus | 0.11 | 0.20 | Switched to fixsliced SKINNY-128-384+ |
Elephant (Delirium) | 0.06 | 0.18 | Optimised 32-bit version of Keccak-p[200] in C |
ISAP-A | 0.10 | 0.10 | |
ISAP-K | 0.02 | 0.02 | |
Elephant (Dumbo) | 0.02 | 0.02 | Improved bit-sliced implementation of Spongent |
Changes in the primary hash algorithm performance for ESP32:
Algorithm | Round 2 | New |
Xoodyak | 0.47 | 0.47 |
SPARKLE | 0.45 | 0.45 |
ASCON | 0.20 | 0.20 |
Romulus | N/A | 0.09 |
PHOTON-Beetle | 0.02 | 0.06 |
My round 2 implementations were focused on 32-bit and 8-bit architectures. I have since added some implementations in C that are designed for 64-bit systems:
64-bit systems are detected by the LW_UTIL_CPU_IS_64BIT define
in internal-util.h. Currently x86-64 and arm64 platforms are recognized. Patches welcome to support other 64-bit architectures.