You can get a lot of embedded processing power for a euro these days.
An ARM Cortex-M0-based STM32F030 costs €1.111 and has approximately the computing power of a 1994-era 486 costing about €4162.
How does modern authenticated encryption run on such devices?
We’ll measure encryption of different length plaintexts. Each encryption will include a 16-byte additionally authenticated data (AAD). Nonce lengths and key sizes are chosen to match each algorithm’s requirements. AES-based algorithms will be tested with both 128-bit and 256-bit keys, NORX32 only uses 128-bit keys, and ChaCha20-Poly1305 only uses 256-bit keys.
For each encryption, we’ll count the number of cycles. We’ll also measure the stack usage and program size.
We count cycles by setting up the standard ARM systick peripheral to tick down once per cycle. When
it reaches zero, we increment a counter and the systick is reloaded with its maximum value (0xffffff
).
We measure stack usage by filling the stack with a pattern before a test starts (from the bottom upwards to its current extent), and then checking how the pattern was overwritten after the test.
We measure program size statically by reading the size of the text section of each test program.
We subtract from this the size of a test program which does nothing. All code is built with -Os
(optimise for size first, speed second) and linked with -gc-sections
to remove unused functions.
Cifra is a collection of cryptography primitives in standard C, targetted towards small embedded devices. The code is intended to be clear, simple, and small. The aim is understanding and quality code, not speed records.
The functions beginning aeadperf_
is the code we’re benchmarking.
Our hardware is a STM32F030F4P6 soldered to a breakout board, which is connected directly to a STLinkV2 debugger. The total cost is:
Item | Supplier | Cost |
---|---|---|
STM32F030F4P6 | Farnell | £0.80 / €1.11 |
STLinkV2 clone | Aliexpress | £2.06 / €2.87 |
TSSOP20 breakout | Aliexpress | £2.68 for 20 / €3.73 |
Total | £5.54 / €7.71 |
Galois Counter Mode is a block cipher mode by McGrew and Viega standardised in SP800-38D.
It encrypts the plaintext in counter mode, and authenticates it using a polynomial MAC called GHASH.
Cifra’s implementation of GHASH has side-channel countermeasures, which makes it slower than other implementations.
EAX is a construction by Bellare, Rogaway and Wagner. It encrypts the plaintext in counter mode, and authenticates it using CMAC.
CCM is a construction by Housley, Whiting and Ferguson. It encrypts the plaintext in counter mode, and authenticates it using CBC-MAC.
Because CBC-MAC doesn’t actually work very well, CCM has a convoluted internal structure and cannot encrypt messages without knowing the length beforehand.
CCM is widely used in other communications protocols like Bluetooth, IPSec, and WPA2.
Norx a candidate in the CAESAR competition and is by Aumasson, Jovanovic and Neves. It’s a very new AEAD algorithm with flavours of Salsa/ChaCha (the core permutation) and Keccak (the sponge structure).
The notation NORX32-4-1
means an instance of NORX using 32-bit words, 4 rounds and no parallelisation.
One NORX round is worth two Salsa/ChaCha rounds, so this is about the same as ChaCha8.
You can expect this to have a lower security bound than ChaCha20, but also be about 2.5 times quicker.
This is a construction recently standardised in RFC7539, glueing together the ChaCha20 stream cipher and Poly1305 one-time MAC to give a general purpose AEAD scheme.
For encrypting a 256-byte message:
Algorithm | Cycles | Stack | Code size | Likely throughput3 |
---|---|---|---|---|
AES-128-CCM | 200048 | 680B | 2316B | 70.27KB/s |
AES-128-EAX | 210087 | 800B | 2604B | 70.07KB/s |
AES-128-GCM | 327313 | 700B | 2644B | 41.30KB/s |
AES-256-CCM | 271787 | 744B | 2400B | 51.45KB/s |
AES-256-EAX | 285730 | 864B | 2684B | 51.35KB/s |
AES-256-GCM | 362200 | 764B | 2728B | 37.49KB/s |
ChaCha20-Poly1305 | 163980 | 756B | 2728B | 94.23KB/s |
NORX32-4-1 | 25115 | 336B | 1808B | 717.02KB/s |
Even adjusting for the different security bound, NORX leads in every metric.
In this chart you can clearly see the 16-byte block size of AES and Poly1305. You can also see the 40-byte input block size of NORX32, and the 64-byte block size of ChaCha20.
The slight decrease in cycles for larger message sizes between whole blocks in CCM and GCM is due to relatively slow code which adds padding – it needs to add more padding for these sizes. This is an area for improvement.
From Farnell, in single quantities. Costs vastly decrease with quantity, or if you buy from chinese suppliers. ↩
Source: 486DX2 50Mhz cost adjusted for today’s money. ↩
This is for one, long message. It therefore discounts set-up costs. ↩