Introduction

There is a lot of variation in the capabilities of embedded microprocessors. Some are superscalar; others are not. Some have specialised vector instructions; others do not. Clock speeds can also vary considerably. All this means that "cycles per byte" or "megabytes per second" are pretty meaningless when trying to rank the algorithms on relative performance on any given microprocessor.

The approach I take here is "ChaChaPoly Units". The library contains a reasonably efficient 32-bit non-vectorized implementation of the ChaChaPoly AEAD scheme from my Arduino cryptography library. This makes it a known quanitity to compare with other algorithms side by side.

If an algorithm is measured at 0.8 ChaChaPoly Units on a specific embedded microprocessor at a specific clock speed, then that means that it is slower than ChaChaPoly by a factor of 0.8 on that microprocessor. If the algorithm is instead measured at 2 ChaChaPoly Units, then it is twice as fast as ChaChaPoly on the same microprocessor. The higher the number of units, the better the algorithm.

The number of ChaChaPoly Units for each algorithm will vary for each microprocessor that is tested and for different choices of optimisation options. The figures below should be used as a rough guide to the relative performance of the algorithms, not an absolute measurement.

For hash algorithms we use BLAKE2s as the basic unit. BLAKE2s is based on ChaCha20 so it is the most logical hashing counterpart to ChaChaPoly.

This page details the performance results for 32-bit platforms. A separate page that details preliminary results for the 8-bit AVR platform can be found here.

The masking performance page contains comparisons of masked versions of the algorithms with their baseline versions.

Performance on ARM Cortex M3

All tests were run on an Arduino Due which is an ARM Cortex M3 running at 84MHz. The code was optimised for size rather than speed, which is the default optimisation option for the Arduino IDE. I found that "-Os" size optimisation often did better on the Due than "-O2" or "-O3" with the compiler that I had. Your own results may vary.

Each algorithm was tested with two packet sizes: 128 and 16 bytes. Some algorithms can have better performance on small packet sizes. The associated data is always zero-length.

The value in the table below indicates the number of times faster than ChaChaPoly on the same packet. Higher numbers mean better performance. The table is ordered from best average performance down.

Where a NIST submission contains multiple algorithms in a family, bold italics indicates the primary algorithm in the family.

All algorithms have been accelerated to some degree with armv7m-compatible assembly code.

Algorithm	Key Bits	Nonce Bits	Tag Bits	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Schwaemm128-128 (SPARKLE)	128	128	128	1.66	1.62	3.08	2.53	2.11
*Xoodyak*	128	128	128	1.77	1.66	2.34	2.16	1.97
*Schwaemm256-128* (SPARKLE)	128	256	128	1.93	1.61	2.13	1.97	1.90
Schwaemm192-192 (SPARKLE)	192	192	192	1.75	1.52	2.12	1.88	1.80
ASCON-128a	128	128	128	1.86	1.70	1.80	1.78	1.78
*ASCON-128*	128	128	128	1.54	1.44	1.78	1.68	1.61
ASCON-80pq	160	128	128	1.52	1.43	1.71	1.65	1.57
Schwaemm256-256 (SPARKLE)	256	256	256	1.23	1.16	1.22	1.11	1.18
*TinyJAMBU-128*	128	96	64	0.87	0.89	1.58	1.57	1.17
*GIFT-COFB*	128	128	128	1.01	1.01	1.16	1.15	1.08
TinyJAMBU-192	192	96	64	0.73	0.76	1.35	1.36	1.00
TinyJAMBU-256	256	96	64	0.67	0.70	1.27	1.28	0.93
*Grain-128AEAD*	128	96	64	0.30	0.33	0.63	0.67	0.45
AES-128-GCM	128	96	128	0.36	0.38	0.50	0.52	0.44
AES-192-GCM	192	96	128	0.34	0.35	0.46	0.48	0.40
AES-256-GCM	256	96	128	0.31	0.33	0.42	0.44	0.37
*Romulus-N*	128	128	128	0.27	0.29	0.32	0.34	0.31
Delirium (Elephant)	128	96	128	0.23	0.25	0.38	0.39	0.30
*PHOTON-Beetle-AEAD-ENC-128*	128	128	128	0.18	0.20	0.33	0.35	0.25
Romulus-M	128	128	128	0.16	0.17	0.22	0.23	0.19
*ISAP-A-128A*	128	128	128	0.24	0.26	0.13	0.14	0.18
Romulus-T	128	128	128	0.07	0.07	0.10	0.11	0.09
PHOTON-Beetle-AEAD-ENC-32	128	128	128	0.05	0.06	0.13	0.14	0.08
ISAP-A-128	128	128	128	0.08	0.08	0.03	0.04	0.05
ISAP-K-128A	128	128	128	0.07	0.07	0.04	0.04	0.05
*Dumbo* (Elephant)	128	96	64	0.03	0.03	0.05	0.05	0.04
Jumbo (Elephant)	128	96	64	0.03	0.03	0.04	0.04	0.04
ISAP-K-128	128	128	128	0.01	0.01	0.01	0.01	0.01

The hash algorithms are compared against BLAKE2s instead of ChaChaPoly:

Algorithm	Hash Bits	1024 bytes	128 bytes	16 bytes	Average
*Esch256* (SPARKLE)	256	0.90	0.79	1.51	1.07
*SHA256-ASM*	256	1.11	0.79	1.15	1.02
*Xoodyak*	256	0.71	0.65	1.43	0.93
ASCON-HASHA	256	0.63	0.46	0.58	0.56
SHA256-C	256	0.56	0.40	0.60	0.52
*ASCON-HASH*	256	0.48	0.38	0.57	0.48
Esch384 (SPARKLE)	384	0.46	0.38	0.59	0.48
*Romulus-H*	256	0.10	0.09	0.22	0.14
*PHOTON-Beetle-HASH*	256	0.02	0.02	0.16	0.07

SHA256-ASM uses a fully unrolled version of the SHA256 block transformation function in armv7m-compatible assembly code. SHA256-C is a straight-forward C version of SHA256 with very little unrolling, designed for small code and memory sizes.

The speed of SHA256 compared with the other candidates can be attributed in part to the "rate" of 64 bytes for SHA256, which allows it to process more data per block operation. The other algorithms have rates of 4, 8, 16, or 32 bytes. If all algorithms had the same rate, then the ordering would be more like this:

Algorithm	Average	Rate	Average / Rate
ASCON-HASHA	0.56	8	0.0700
Esch256 (SPARKLE)	1.07	16	0.0669
ASCON-HASH	0.52	8	0.0650
Xoodyak	0.93	16	0.0581
Esch384 (SPARKLE)	0.48	16	0.0300
PHOTON-Beetle-HASH	0.07	4	0.0175
SHA256-ASM	1.02	64	0.0159
SHA256-C	0.52	64	0.0081
Romulus-H	0.14	32	0.0044

Performance on ESP32

The tests below were run on an ESP32 Dev Module running at 240MHz. The ordering is mostly the same as ARM Cortext M3 with a few reversals where the architectural differences gives some algorithms an added advantage.

Algorithm	Key Bits	Nonce Bits	Tag Bits	Encrypt 128 bytes	Decrypt 128 bytes	Encrypt 16 bytes	Decrypt 16 bytes	Average
Schwaemm128-128 (SPARKLE)	128	128	128	1.07	1.06	1.68	1.60	1.32
*Schwaemm256-128* (SPARKLE)	128	256	128	1.11	1.09	1.04	1.04	1.06
*Xoodyak*	128	128	128	0.91	0.92	1.06	1.07	0.99
Schwaemm192-192 (SPARKLE)	192	192	192	0.87	0.90	1.02	1.00	0.95
ASCON-128a	128	128	128	0.86	0.88	0.92	0.93	0.90
*GIFT-COFB*	128	128	128	0.80	0.83	0.90	0.90	0.86
*TinyJAMBU-128*	128	96	64	0.62	0.64	1.12	1.12	0.83
TinyJAMBU-192	192	96	64	0.55	0.57	1.01	1.02	0.75
Schwaemm256-256 (SPARKLE)	256	256	256	0.77	0.78	0.70	0.70	0.73
AES-128-GCM	128	96	128	0.59	0.60	0.82	0.83	0.70
AES-192-GCM	192	96	128	0.54	0.56	0.76	0.77	0.65
TinyJAMBU-256	256	96	64	0.47	0.49	0.89	0.91	0.65
*ASCON-128*	128	128	128	0.67	0.46	0.86	0.66	0.63
ASCON-80pq	160	128	128	0.67	0.44	0.84	0.61	0.61
AES-256-GCM	256	96	128	0.50	0.52	0.68	0.69	0.59
*Grain-128AEAD*	128	96	64	0.33	0.32	0.60	0.59	0.43
*PHOTON-Beetle-AEAD-ENC-128*	128	128	128	0.16	0.18	0.30	0.32	0.23
*Romulus-N*	128	128	128	0.17	0.20	0.18	0.24	0.20
Delirium (Elephant)	128	96	128	0.14	0.15	0.22	0.23	0.18
Romulus-M	128	128	128	0.09	0.11	0.12	0.17	0.12
*ISAP-A-128A*	128	128	128	0.13	0.15	0.08	0.09	0.10
PHOTON-Beetle-AEAD-ENC-32	128	128	128	0.04	0.05	0.12	0.13	0.07
Romulus-T	128	128	128	0.04	0.05	0.07	0.09	0.06
ISAP-K-128A	128	128	128	0.03	0.03	0.02	0.02	0.02
ISAP-A-128	128	128	128	0.03	0.03	0.01	0.02	0.02
*Dumbo* (Elephant)	128	96	64	0.01	0.01	0.02	0.02	0.02
Jumbo (Elephant)	128	96	64	0.01	0.02	0.02	0.02	0.02
ISAP-K-128	128	128	128	0.0040	0.0047	0.0018	0.0020	0.0025

Hash algorithms:

Algorithm	Hash Bits	1024 bytes	128 bytes	16 bytes	Average
*Xoodyak*	256	0.35	0.33	0.73	0.47
*SHA256-C*	256	0.47	0.37	0.55	0.47
*Esch256* (SPARKLE)	256	0.38	0.34	0.64	0.45
Esch384 (SPARKLE)	384	0.24	0.20	0.30	0.25
ASCON-HASHA	256	0.27	0.20	0.25	0.24
*ASCON-HASH*	256	0.19	0.16	0.24	0.20
*Romulus-H*	256	0.07	0.06	0.09	0.09
*PHOTON-Beetle-HASH*	256	0.02	0.02	0.15	0.06

SHA256-C does quite well on ESP32. In large part this is because SHA256's "rate" is 64 bytes which allows it to process more data per block operation than the other algorithms whose rate is 4, 8, 16, or 32 bytes per block operation.

All of the algorithms suffer on ESP32 because the CPU does not have a native word rotation instruction. BLAKE2s and SHA256 have a lower percentage of word rotations per round, so they are less affected by the CPU's shortcomings.

Overall group rankings

Based on the above data, the NIST submissions can be roughly grouped with those of similar performance. Changes in CPU, optimisation options, loop unrolling, or assembly code replacement might modify the rank of an algorithm.

Only the primary algorithm in each family is considered for this ranking. I took the average of the ARM Cortex M3 and ESP32 figures from the above tables to compute an average across different architectures. I then grouped the algorithms into 0.1-wide buckets; for example everything with rank 3 has an average between 0.30 and 0.39 ChaChaPoly units.

AEAD algorithm rankings:

Rank	Algorithms
14	SPARKLE, Xoodyak
11	ASCON
10	TinyJAMBU
9	GIFT-COFB
5	AES-128-GCM
4	Grain128-AEAD
2	PHOTON-Beetle, Romulus
1	ISAP
0	Elephant

Hash algorithm rankings:

Rank	Algorithms
7	SHA256, SPARKLE, Xoodyak
3	ASCON
0	PHOTON-Beetle, Romulus

Changes in ARM Cortex M3 performance since Round 2

There have been many improvements to the performance of my implementations since Round 2, and some tweaks to the algorithms themselves to change the number of rounds or other aspects of the algorithms. This section summarises the changes.

ARM Cortex M3 has seen the greatest performance improvement with the introduction of assembly code versions of most algorithms. We compare the baseline C versions from Round 2 with the current ChaChaPoly figures.

I did have some ARM Cortex M3 assembly code versions in my Round 2 repository, but they were implemented after the cut-off date for Round 2 status updates.

Changes in the primary AEAD algorithm performance for ARM Cortex M3, ordered from highest to lowest "New" ChaChaPoly values:

Algorithm	Round 2	New	Notable changes other than the use of assembly code
Xoodyak	0.86	1.97	Final round tweak improved performance on small packets
SPARKLE	1.09	1.90
ASCON	1.11	1.61
TinyJAMBU	0.81	1.17
GIFT-COFB	1.05	1.08
Grain128-AEAD	0.37	0.45
Romulus	0.19	0.31	Switched to fixsliced SKINNY-128-384+
Elephant (Delirium)	0.05	0.30	Optimised 32-bit and 64-bit versions of Keccak-p[200] in C
PHOTON-Beetle	0.08	0.25	Highly unrolled 32-bit version in C
ISAP-A	0.13	0.18
ISAP-K	0.02	0.05	Optimised 64-bit version of Keccak-p[400] in C
Elephant (Dumbo)	0.02	0.04	Improved bit-sliced implementation of Spongent

Note: The primary version of Elephant is the Spongent-based Dumbo, but the Keccak-based Delirium has improved significantly so I included that as well.

Changes in the primary hash algorithm performance for ARM Cortex M3:

Algorithm	Round 2	New
SPARKLE	0.46	1.07
Xoodyak	0.51	0.93
ASCON	0.30	0.48
Romulus	N/A	0.14
PHOTON-Beetle	0.02	0.07

Changes in ESP32 performance since Round 2

The ESP32 implementations are still in C, so the improvements in the AEAD encryption schemes were more modest with a few notable changes:

Algorithm	Round 2	New	Notable changes
SPARKLE	1.06	1.06
Xoodyak	0.83	0.99	Final round tweak improved performance on small packets
TinyJAMBU	0.71	0.83	Separate the permutations for 128, 192, and 256 bit key sizes and unroll
GIFT-COFB	0.86	0.86
ASCON	0.63	0.63
Grain128-AEAD	0.43	0.43
PHOTON-Beetle	0.08	0.23	Highly unrolled 32-bit version in C
Romulus	0.11	0.20	Switched to fixsliced SKINNY-128-384+
Elephant (Delirium)	0.06	0.18	Optimised 32-bit version of Keccak-p[200] in C
ISAP-A	0.10	0.10
ISAP-K	0.02	0.02
Elephant (Dumbo)	0.02	0.02	Improved bit-sliced implementation of Spongent

Changes in the primary hash algorithm performance for ESP32:

Algorithm	Round 2	New
Xoodyak	0.47	0.47
SPARKLE	0.45	0.45
ASCON	0.20	0.20
Romulus	N/A	0.09
PHOTON-Beetle	0.02	0.06

Algorithms with native 64-bit support

My round 2 implementations were focused on 32-bit and 8-bit architectures. I have since added some implementations in C that are designed for 64-bit systems:

ASCON defaults to using 64-bit words on 64-bit platforms (also used by ISAP-A).
Keccak-p[200] variant that is optimised for 64-bit words, with each row held in a 64-bit register (used by Elephant).
Keccak-p[400] variant that is optimised for 64-bit words, with each row held in a pair of 64-bit registers (used by ISAP-K).
TinyJAMBU variant that divides the 128-bit state up into two 64-bit words instead of four 32-bit words. This halves the number of shift and OR operations that are needed to implement the permutation on 64-bit systems.

64-bit systems are detected by the LW_UTIL_CPU_IS_64BIT define in internal-util.h. Currently x86-64 and arm64 platforms are recognized. Patches welcome to support other 64-bit architectures.

Table of Contents