Lightweight Cryptography Primitives
 All Data Structures Files Functions Variables Typedefs Macros Pages
Performance on 32-bit platforms

Table of Contents

Introduction

There is a lot of variation in the capabilities of embedded microprocessors. Some are superscalar; others are not. Some have specialised vector instructions; others do not. Clock speeds can also vary considerably. All this means that "cycles per byte" or "megabytes per second" are pretty meaningless when trying to rank the algorithms on relative performance on any given microprocessor.

The approach I take here is "ChaChaPoly Units". The library contains a reasonably efficient 32-bit non-vectorized implementation of the ChaChaPoly AEAD scheme from my Arduino cryptography library. This makes it a known quanitity to compare with other algorithms side by side.

If an algorithm is measured at 0.8 ChaChaPoly Units on a specific embedded microprocessor at a specific clock speed, then that means that it is slower than ChaChaPoly by a factor of 0.8 on that microprocessor. If the algorithm is instead measured at 2 ChaChaPoly Units, then it is twice as fast as ChaChaPoly on the same microprocessor. The higher the number of units, the better the algorithm.

The number of ChaChaPoly Units for each algorithm will vary for each microprocessor that is tested and for different choices of optimisation options. The figures below should be used as a rough guide to the relative performance of the algorithms, not an absolute measurement.

For hash algorithms we use BLAKE2s as the basic unit. BLAKE2s is based on ChaCha20 so it is the most logical hashing counterpart to ChaChaPoly.

This page details the performance results for 32-bit platforms. A separate page that details preliminary results for the 8-bit AVR platform can be found here.

The masking performance page contains comparisons of masked versions of the algorithms with their baseline versions.

Performance on ARM Cortex M3

All tests were run on an Arduino Due which is an ARM Cortex M3 running at 84MHz. The code was optimised for size rather than speed, which is the default optimisation option for the Arduino IDE. I found that "-Os" size optimisation often did better on the Due than "-O2" or "-O3" with the compiler that I had. Your own results may vary.

Each algorithm was tested with two packet sizes: 128 and 16 bytes. Some algorithms can have better performance on small packet sizes. The associated data is always zero-length.

The value in the table below indicates the number of times faster than ChaChaPoly on the same packet. Higher numbers mean better performance. The table is ordered from best average performance down.

Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family.

All algorithms have been accelerated to some degree with armv7m-compatible assembly code.

AlgorithmKey BitsNonce BitsTag BitsEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Schwaemm128-128 (SPARKLE)1281281281.661.623.082.532.11
Xoodyak1281281281.771.662.342.161.97
Schwaemm256-128 (SPARKLE)1282561281.931.612.131.971.90
Schwaemm192-192 (SPARKLE)1921921921.751.522.121.881.80
ASCON-128a1281281281.861.701.801.781.78
ASCON-1281281281281.541.441.781.681.61
ASCON-80pq1601281281.521.431.711.651.57
Schwaemm256-256 (SPARKLE)2562562561.231.161.221.111.18
TinyJAMBU-12812896640.870.891.581.571.17
GIFT-COFB1281281281.011.011.161.151.08
TinyJAMBU-19219296640.730.761.351.361.00
TinyJAMBU-25625696640.670.701.271.280.93
Grain-128AEAD12896640.300.330.630.670.45
AES-128-GCM128961280.360.380.500.520.44
AES-192-GCM192961280.340.350.460.480.40
AES-256-GCM256961280.310.330.420.440.37
Romulus-N1281281280.270.290.320.340.31
Delirium (Elephant)128961280.230.250.380.390.30
PHOTON-Beetle-AEAD-ENC-1281281281280.180.200.330.350.25
Romulus-M1281281280.160.170.220.230.19
ISAP-A-128A1281281280.240.260.130.140.18
Romulus-T1281281280.070.070.100.110.09
PHOTON-Beetle-AEAD-ENC-321281281280.050.060.130.140.08
ISAP-A-1281281281280.080.080.030.040.05
ISAP-K-128A1281281280.070.070.040.040.05
Dumbo (Elephant)12896640.030.030.050.050.04
Jumbo (Elephant)12896640.030.030.040.040.04
ISAP-K-1281281281280.010.010.010.010.01

The hash algorithms are compared against BLAKE2s instead of ChaChaPoly:

AlgorithmHash Bits1024 bytes128 bytes16 bytesAverage
Esch256 (SPARKLE)2560.900.791.511.07
SHA256-ASM2561.110.791.151.02
Xoodyak2560.710.651.430.93
ASCON-HASHA2560.630.460.580.56
SHA256-C2560.560.400.600.52
ASCON-HASH2560.480.380.570.48
Esch384 (SPARKLE)3840.460.380.590.48
Romulus-H2560.100.090.220.14
PHOTON-Beetle-HASH2560.020.020.160.07

SHA256-ASM uses a fully unrolled version of the SHA256 block transformation function in armv7m-compatible assembly code. SHA256-C is a straight-forward C version of SHA256 with very little unrolling, designed for small code and memory sizes.

The speed of SHA256 compared with the other candidates can be attributed in part to the "rate" of 64 bytes for SHA256, which allows it to process more data per block operation. The other algorithms have rates of 4, 8, 16, or 32 bytes. If all algorithms had the same rate, then the ordering would be more like this:

AlgorithmAverageRateAverage / Rate
ASCON-HASHA0.5680.0700
Esch256 (SPARKLE)1.07160.0669
ASCON-HASH0.5280.0650
Xoodyak0.93160.0581
Esch384 (SPARKLE)0.48160.0300
PHOTON-Beetle-HASH0.0740.0175
SHA256-ASM1.02640.0159
SHA256-C0.52640.0081
Romulus-H0.14320.0044

Performance on ESP32

The tests below were run on an ESP32 Dev Module running at 240MHz. The ordering is mostly the same as ARM Cortext M3 with a few reversals where the architectural differences gives some algorithms an added advantage.

AlgorithmKey BitsNonce BitsTag BitsEncrypt 128 bytesDecrypt 128 bytesEncrypt 16 bytesDecrypt 16 bytesAverage
Schwaemm128-128 (SPARKLE)1281281281.071.061.681.601.32
Schwaemm256-128 (SPARKLE)1282561281.111.091.041.041.06
Xoodyak1281281280.910.921.061.070.99
Schwaemm192-192 (SPARKLE)1921921920.870.901.021.000.95
ASCON-128a1281281280.860.880.920.930.90
GIFT-COFB1281281280.800.830.900.900.86
TinyJAMBU-12812896640.620.641.121.120.83
TinyJAMBU-19219296640.550.571.011.020.75
Schwaemm256-256 (SPARKLE)2562562560.770.780.700.700.73
AES-128-GCM128961280.590.600.820.830.70
AES-192-GCM192961280.540.560.760.770.65
TinyJAMBU-25625696640.470.490.890.910.65
ASCON-1281281281280.670.460.860.660.63
ASCON-80pq1601281280.670.440.840.610.61
AES-256-GCM256961280.500.520.680.690.59
Grain-128AEAD12896640.330.320.600.590.43
PHOTON-Beetle-AEAD-ENC-1281281281280.160.180.300.320.23
Romulus-N1281281280.170.200.180.240.20
Delirium (Elephant)128961280.140.150.220.230.18
Romulus-M1281281280.090.110.120.170.12
ISAP-A-128A1281281280.130.150.080.090.10
PHOTON-Beetle-AEAD-ENC-321281281280.040.050.120.130.07
Romulus-T1281281280.040.050.070.090.06
ISAP-K-128A1281281280.030.030.020.020.02
ISAP-A-1281281281280.030.030.010.020.02
Dumbo (Elephant)12896640.010.010.020.020.02
Jumbo (Elephant)12896640.010.020.020.020.02
ISAP-K-1281281281280.00400.00470.00180.00200.0025

Hash algorithms:

AlgorithmHash Bits1024 bytes128 bytes16 bytesAverage
Xoodyak2560.350.330.730.47
SHA256-C2560.470.370.550.47
Esch256 (SPARKLE)2560.380.340.640.45
Esch384 (SPARKLE)3840.240.200.300.25
ASCON-HASHA2560.270.200.250.24
ASCON-HASH2560.190.160.240.20
Romulus-H2560.070.060.090.09
PHOTON-Beetle-HASH2560.020.020.150.06

SHA256-C does quite well on ESP32. In large part this is because SHA256's "rate" is 64 bytes which allows it to process more data per block operation than the other algorithms whose rate is 4, 8, 16, or 32 bytes per block operation.

All of the algorithms suffer on ESP32 because the CPU does not have a native word rotation instruction. BLAKE2s and SHA256 have a lower percentage of word rotations per round, so they are less affected by the CPU's shortcomings.

Overall group rankings

Based on the above data, the NIST submissions can be roughly grouped with those of similar performance. Changes in CPU, optimisation options, loop unrolling, or assembly code replacement might modify the rank of an algorithm.

Only the primary algorithm in each family is considered for this ranking. I took the average of the ARM Cortex M3 and ESP32 figures from the above tables to compute an average across different architectures. I then grouped the algorithms into 0.1-wide buckets; for example everything with rank 3 has an average between 0.30 and 0.39 ChaChaPoly units.

AEAD algorithm rankings:

RankAlgorithms
14SPARKLE, Xoodyak
11ASCON
10TinyJAMBU
9GIFT-COFB
5AES-128-GCM
4Grain128-AEAD
2PHOTON-Beetle, Romulus
1ISAP
0Elephant

Hash algorithm rankings:

RankAlgorithms
7SHA256, SPARKLE, Xoodyak
3ASCON
0PHOTON-Beetle, Romulus

Changes in ARM Cortex M3 performance since Round 2

There have been many improvements to the performance of my implementations since Round 2, and some tweaks to the algorithms themselves to change the number of rounds or other aspects of the algorithms. This section summarises the changes.

ARM Cortex M3 has seen the greatest performance improvement with the introduction of assembly code versions of most algorithms. We compare the baseline C versions from Round 2 with the current ChaChaPoly figures.

I did have some ARM Cortex M3 assembly code versions in my Round 2 repository, but they were implemented after the cut-off date for Round 2 status updates.

Changes in the primary AEAD algorithm performance for ARM Cortex M3, ordered from highest to lowest "New" ChaChaPoly values:

AlgorithmRound 2NewNotable changes other than the use of assembly code
Xoodyak0.861.97Final round tweak improved performance on small packets
SPARKLE1.091.90
ASCON1.111.61
TinyJAMBU0.811.17
GIFT-COFB1.051.08
Grain128-AEAD0.370.45
Romulus0.190.31Switched to fixsliced SKINNY-128-384+
Elephant (Delirium)0.050.30Optimised 32-bit and 64-bit versions of Keccak-p[200] in C
PHOTON-Beetle0.080.25Highly unrolled 32-bit version in C
ISAP-A0.130.18
ISAP-K0.020.05Optimised 64-bit version of Keccak-p[400] in C
Elephant (Dumbo)0.020.04Improved bit-sliced implementation of Spongent

Note: The primary version of Elephant is the Spongent-based Dumbo, but the Keccak-based Delirium has improved significantly so I included that as well.

Changes in the primary hash algorithm performance for ARM Cortex M3:

AlgorithmRound 2New
SPARKLE0.461.07
Xoodyak0.510.93
ASCON0.300.48
RomulusN/A0.14
PHOTON-Beetle0.020.07

Changes in ESP32 performance since Round 2

The ESP32 implementations are still in C, so the improvements in the AEAD encryption schemes were more modest with a few notable changes:

AlgorithmRound 2NewNotable changes
SPARKLE1.061.06
Xoodyak0.830.99Final round tweak improved performance on small packets
TinyJAMBU0.710.83Separate the permutations for 128, 192, and 256 bit key sizes and unroll
GIFT-COFB0.860.86
ASCON0.630.63
Grain128-AEAD0.430.43
PHOTON-Beetle0.080.23Highly unrolled 32-bit version in C
Romulus0.110.20Switched to fixsliced SKINNY-128-384+
Elephant (Delirium)0.060.18Optimised 32-bit version of Keccak-p[200] in C
ISAP-A0.100.10
ISAP-K0.020.02
Elephant (Dumbo)0.020.02Improved bit-sliced implementation of Spongent

Changes in the primary hash algorithm performance for ESP32:

AlgorithmRound 2New
Xoodyak0.470.47
SPARKLE0.450.45
ASCON0.200.20
RomulusN/A0.09
PHOTON-Beetle0.020.06

Algorithms with native 64-bit support

My round 2 implementations were focused on 32-bit and 8-bit architectures. I have since added some implementations in C that are designed for 64-bit systems:

64-bit systems are detected by the LW_UTIL_CPU_IS_64BIT define in internal-util.h. Currently x86-64 and arm64 platforms are recognized. Patches welcome to support other 64-bit architectures.