jbp.io Archive
01 June 2015

Benchmarking Modern Authenticated Encryption on €1 devices

You can get a lot of embedded processing power for a euro these days.

An ARM Cortex-M0-based STM32F030 costs €1.111 and has approximately the computing power of a 1994-era 486 costing about €4162.

How does modern authenticated encryption run on such devices?

Methodology

We’ll measure encryption of different length plaintexts. Each encryption will include a 16-byte additionally authenticated data (AAD). Nonce lengths and key sizes are chosen to match each algorithm’s requirements. AES-based algorithms will be tested with both 128-bit and 256-bit keys, NORX32 only uses 128-bit keys, and ChaCha20-Poly1305 only uses 256-bit keys.

For each encryption, we’ll count the number of cycles. We’ll also measure the stack usage and program size.

We count cycles by setting up the standard ARM systick peripheral to tick down once per cycle. When it reaches zero, we increment a counter and the systick is reloaded with its maximum value (0xffffff).

We measure stack usage by filling the stack with a pattern before a test starts (from the bottom upwards to its current extent), and then checking how the pattern was overwritten after the test.

We measure program size statically by reading the size of the text section of each test program. We subtract from this the size of a test program which does nothing. All code is built with -Os (optimise for size first, speed second) and linked with -gc-sections to remove unused functions.

Code

Cifra is a collection of cryptography primitives in standard C, targetted towards small embedded devices. The code is intended to be clear, simple, and small. The aim is understanding and quality code, not speed records.

The functions beginning aeadperf_ is the code we’re benchmarking.

Hardware

Our hardware is a STM32F030F4P6 soldered to a breakout board, which is connected directly to a STLinkV2 debugger. The total cost is:

Item Supplier Cost
STM32F030F4P6 Farnell £0.80 / €1.11
STLinkV2 clone Aliexpress £2.06 / €2.87
TSSOP20 breakout Aliexpress £2.68 for 20 / €3.73
Total   £5.54 / €7.71

The contenders

AES-GCM

Galois Counter Mode is a block cipher mode by McGrew and Viega standardised in SP800-38D.

It encrypts the plaintext in counter mode, and authenticates it using a polynomial MAC called GHASH.

Cifra’s implementation of GHASH has side-channel countermeasures, which makes it slower than other implementations.

AES-EAX

EAX is a construction by Bellare, Rogaway and Wagner. It encrypts the plaintext in counter mode, and authenticates it using CMAC.

AES-CCM

CCM is a construction by Housley, Whiting and Ferguson. It encrypts the plaintext in counter mode, and authenticates it using CBC-MAC.

Because CBC-MAC doesn’t actually work very well, CCM has a convoluted internal structure and cannot encrypt messages without knowing the length beforehand.

CCM is widely used in other communications protocols like Bluetooth, IPSec, and WPA2.

NORX32-4-1

Norx a candidate in the CAESAR competition and is by Aumasson, Jovanovic and Neves. It’s a very new AEAD algorithm with flavours of Salsa/ChaCha (the core permutation) and Keccak (the sponge structure).

The notation NORX32-4-1 means an instance of NORX using 32-bit words, 4 rounds and no parallelisation. One NORX round is worth two Salsa/ChaCha rounds, so this is about the same as ChaCha8. You can expect this to have a lower security bound than ChaCha20, but also be about 2.5 times quicker.

ChaCha20-Poly1305

This is a construction recently standardised in RFC7539, glueing together the ChaCha20 stream cipher and Poly1305 one-time MAC to give a general purpose AEAD scheme.

Results

For encrypting a 256-byte message:

Algorithm Cycles Stack Code size Likely throughput3
AES-128-CCM 200048 680B 2316B 70.27KB/s
AES-128-EAX 210087 800B 2604B 70.07KB/s
AES-128-GCM 327313 700B 2644B 41.30KB/s
AES-256-CCM 271787 744B 2400B 51.45KB/s
AES-256-EAX 285730 864B 2684B 51.35KB/s
AES-256-GCM 362200 764B 2728B 37.49KB/s
ChaCha20-Poly1305 163980 756B 2728B 94.23KB/s
NORX32-4-1 25115 336B 1808B 717.02KB/s

Even adjusting for the different security bound, NORX leads in every metric.

In this chart you can clearly see the 16-byte block size of AES and Poly1305. You can also see the 40-byte input block size of NORX32, and the 64-byte block size of ChaCha20.

The slight decrease in cycles for larger message sizes between whole blocks in CCM and GCM is due to relatively slow code which adds padding – it needs to add more padding for these sizes. This is an area for improvement.


  1. From Farnell, in single quantities. Costs vastly decrease with quantity, or if you buy from chinese suppliers

  2. Source: 486DX2 50Mhz cost adjusted for today’s money. 

  3. This is for one, long message. It therefore discounts set-up costs.