01 June 2015

Benchmarking Modern Authenticated Encryption on €1 devices

Methodology
- Code
- Hardware
The contenders
Results

You can get a lot of embedded processing power for a euro these days.

An ARM Cortex-M0-based STM32F030 costs €1.11¹ and has approximately the computing power of a 1994-era 486 costing about €416².

How does modern authenticated encryption run on such devices?

Methodology

We’ll measure encryption of different length plaintexts. Each encryption will include a 16-byte additionally authenticated data (AAD). Nonce lengths and key sizes are chosen to match each algorithm’s requirements. AES-based algorithms will be tested with both 128-bit and 256-bit keys, NORX32 only uses 128-bit keys, and ChaCha20-Poly1305 only uses 256-bit keys.

For each encryption, we’ll count the number of cycles. We’ll also measure the stack usage and program size.

We count cycles by setting up the standard ARM systick peripheral to tick down once per cycle. When it reaches zero, we increment a counter and the systick is reloaded with its maximum value (0xffffff).

We measure stack usage by filling the stack with a pattern before a test starts (from the bottom upwards to its current extent), and then checking how the pattern was overwritten after the test.

We measure program size statically by reading the size of the text section of each test program. We subtract from this the size of a test program which does nothing. All code is built with -Os (optimise for size first, speed second) and linked with -gc-sections to remove unused functions.

Code

Cifra is a collection of cryptography primitives in standard C, targetted towards small embedded devices. The code is intended to be clear, simple, and small. The aim is understanding and quality code, not speed records.

The functions beginning aeadperf_ is the code we’re benchmarking.

Hardware

Our hardware is a STM32F030F4P6 soldered to a breakout board, which is connected directly to a STLinkV2 debugger. The total cost is:

Item	Supplier	Cost
STM32F030F4P6	Farnell	£0.80 / €1.11
STLinkV2 clone	Aliexpress	£2.06 / €2.87
TSSOP20 breakout	Aliexpress	£2.68 for 20 / €3.73
Total		£5.54 / €7.71

The contenders

AES-GCM

Galois Counter Mode is a block cipher mode by McGrew and Viega standardised in SP800-38D.

It encrypts the plaintext in counter mode, and authenticates it using a polynomial MAC called GHASH.

Cifra’s implementation of GHASH has side-channel countermeasures, which makes it slower than other implementations.

AES-EAX

EAX is a construction by Bellare, Rogaway and Wagner. It encrypts the plaintext in counter mode, and authenticates it using CMAC.

AES-CCM

CCM is a construction by Housley, Whiting and Ferguson. It encrypts the plaintext in counter mode, and authenticates it using CBC-MAC.

Because CBC-MAC doesn’t actually work very well, CCM has a convoluted internal structure and cannot encrypt messages without knowing the length beforehand.

CCM is widely used in other communications protocols like Bluetooth, IPSec, and WPA2.

NORX32-4-1

Norx a candidate in the CAESAR competition and is by Aumasson, Jovanovic and Neves. It’s a very new AEAD algorithm with flavours of Salsa/ChaCha (the core permutation) and Keccak (the sponge structure).

The notation NORX32-4-1 means an instance of NORX using 32-bit words, 4 rounds and no parallelisation. One NORX round is worth two Salsa/ChaCha rounds, so this is about the same as ChaCha8. You can expect this to have a lower security bound than ChaCha20, but also be about 2.5 times quicker.

ChaCha20-Poly1305

This is a construction recently standardised in RFC7539, glueing together the ChaCha20 stream cipher and Poly1305 one-time MAC to give a general purpose AEAD scheme.

Results

For encrypting a 256-byte message:

Algorithm	Cycles	Stack	Code size	Likely throughput³
AES-128-CCM	200048	680B	2316B	70.27KB/s
AES-128-EAX	210087	800B	2604B	70.07KB/s
AES-128-GCM	327313	700B	2644B	41.30KB/s
AES-256-CCM	271787	744B	2400B	51.45KB/s
AES-256-EAX	285730	864B	2684B	51.35KB/s
AES-256-GCM	362200	764B	2728B	37.49KB/s
ChaCha20-Poly1305	163980	756B	2728B	94.23KB/s
NORX32-4-1	25115	336B	1808B	717.02KB/s

Even adjusting for the different security bound, NORX leads in every metric.

In this chart you can clearly see the 16-byte block size of AES and Poly1305. You can also see the 40-byte input block size of NORX32, and the 64-byte block size of ChaCha20.

The slight decrease in cycles for larger message sizes between whole blocks in CCM and GCM is due to relatively slow code which adds padding – it needs to add more padding for these sizes. This is an area for improvement.

From Farnell, in single quantities. Costs vastly decrease with quantity, or if you buy from chinese suppliers. ↩
Source: 486DX2 50Mhz cost adjusted for today’s money. ↩
This is for one, long message. It therefore discounts set-up costs. ↩