Lightweight Cryptography Primitives
 All Data Structures Files Functions Variables Typedefs Macros Pages
Utilities for masked ciphers

Table of Contents

This page describes the utilities in the library that support the implementation of masked ciphers in plain C.

References

R. Cramer, I. Damgård, and Y. Ishai. Share Conversion, Pseudorandom Secret-Sharing and Applications to Secure Computation. In J. Kilian, editor, Theory of Cryptography Conference – TCC 2005, volume 3378 of Lecture Notes in Computer Science, pages 342–362. Springer, 2005, PDF.

Matthieu Rivain and Emmanuel Prouff. Provably secure higher-order masking of AES. In Stefan Mangard and François-Xavier Standaert, editors, CHES, volume 6225 of Lecture Notes in Computer Science, pages 413–427.Springer, 2010, PDF.

Why do we need masking?

Applications in embedded devices may encrypt or decrypt many times with the same key, either because the key is fixed into the device, or the same key is used on every packet in a session.

Each time the key is used, the power consumption of the device will fluctuate slightly based on the 0 and 1 bits from the key and the plaintext data. Over time it is possible to build up a statistical profile of a particular key's power consumption compared with other keys.

In the simplest form, encrypting with a 1 bit from the key uses a different amount of power than encrypting with a 0 bit. If the power consumption can be measured accurately enough, it may be possible to simply read the bits from the power consumption trace. This is called Simple Power Analysis (SPA).

SPA has historically been used to break RSA encryption on embedded devices. A zero bit involves a square operation, and a 1 bit involves square-and-multiply. The difference in power consumption means that the 0 and 1 bits of the RSA key can be literally read straight off an oscilloscope's screen.

A more powerful technique is Differential Power Analysis (DPA) which can pull useful information out of otherwise noisy data given enough statistical samples. The power consumption will bias one way or the other based on the key in a manner that DPA can detect.

More information on power analysis: https://en.wikipedia.org/wiki/Power_analysis

What is masking?

Masking is a technique to mitigate against power analysis side channels. For a traditional block cipher, the key and the plaintext input are split up into multiple random "shares", the cipher is applied to each of the shares, and then the results are combined to produce the expected result.

Consider a classical block cipher operation E(K, P) which encrypts a plaintext P under a key K. We can split this into three shares as follows:

Because the values K2, K3, P2, and P3 are different every time the block operation is invoked, it is difficult for SPA and DPA to collect usable statistical data and the attack is mitigated. There may still be other ways to recover the key, such as directly hacking into the device and stealing it, but power analysis is made more difficult.

The complexity of masking comes in the fourth step above: computing the three block operations in parallel. If the block cipher involves only XOR, rotate, and shift operations, then it truly can be computed in parallel with no interaction between the shares until the results are combined at the end.

However, all secure ciphers will also include AND and OR operations to mix the data more thoroughly. As we'll see below, this introduces some complexity into the calculations. The utilities in this library handle the complexity to make it easier to write plain C code that uses masking.

Example: Bit-sliced GIFT-128

We begin with an example of converting parts of the bit-sliced version of GIFT-128 into masked form.

At the start of a GIFT-128 encryption operation, the 128-bit plaintext input and the 128-bit key are loaded into 32-bit registers from the big-endian input buffers:

uint32_t s0 = be_load_word32(input);
uint32_t s1 = be_load_word32(input + 4);
uint32_t s2 = be_load_word32(input + 8);
uint32_t s3 = be_load_word32(input + 12);
uint32_t k0 = be_load_word32(key);
uint32_t k1 = be_load_word32(key + 4);
uint32_t k2 = be_load_word32(key + 8);
uint32_t k3 = be_load_word32(key + 12);

Using the masking utilities, this becomes:

mask_uint32_t s0, s1, s2, s3;
mask_uint32_t k0, k1, k2, k3;
mask_input(s0, be_load_word32(input));
mask_input(s1, be_load_word32(input + 4));
mask_input(s2, be_load_word32(input + 8));
mask_input(s3, be_load_word32(input + 12));
mask_input(k0, be_load_word32(key));
mask_input(k1, be_load_word32(key + 4));
mask_input(k2, be_load_word32(key + 8));
mask_input(k3, be_load_word32(key + 12));

The mask_uint32_t type contains the shares for a masked 32-bit word. The default implementation of this type has 4 shares:

typedef struct
{
uint32_t a;
uint32_t b;
uint32_t c;
uint32_t d;

The mask_input() macro generates random material and masks the value:

s0.a = be_load_word32(input) ^ s0.b ^ s0.c ^ s0.d;

The aead_random_generate_32() function is internal to the library and generates a random 32-bit value each time it is called. Recent 32-bit ARM and other microcontrollers come with an instruction or peripheral register that provides a random 32-bit value, so obtaining random data is not hard.

Note: You need to call aead_random_init() before any use of the masking utilities to initialize the random number source. But that can be placed into your application's startup code or your AEAD mode's setup phase.

The masking is repeated for the other plaintext and key words. Later, we compute the bit-sliced GIFT-128 S-box as follows:

s1 ^= s0 & s2;
s0 ^= s1 & s3;
s2 ^= s0 | s1;
s3 ^= s2;
s1 ^= s3;
s3 ^= 0xFFFFFFFFU;
s2 ^= s0 & s1;
t = s0;
s0 = s3;
s3 = t;

Using the masking utilities, this becomes:

mask_and(s1, s0, s2);
mask_and(s0, s1, s3);
mask_or(s2, s0, s1);
mask_xor(s3, s2);
mask_xor(s1, s3);
mask_and(s2, s0, s1);
mask_swap(s0, s3);

As can be seen, the structure of the masked version of the S-box is very similar to the original, which aids in debugging the implementation during development.

Finally, once encryption is complete, we need to store the ciphertext to the output buffer in big-endian order:

be_store_word32(output, mask_output(s0));
be_store_word32(output + 4, mask_output(s1));
be_store_word32(output + 8, mask_output(s2));
be_store_word32(output + 12, mask_output(s3));

After macro expansion, this is equivalent to:

be_store_word32(output, s0.a ^ s0.b ^ s0.c ^ s0.d);
be_store_word32(output + 4, s1.a ^ s1.b ^ s1.c ^ s1.d);
be_store_word32(output + 8, s2.a ^ s2.b ^ s2.c ^ s2.d);
be_store_word32(output + 12, s3.a ^ s3.b ^ s3.c ^ s3.d);

Types

The library provides explicit types for 2, 3, 4, 5, and 6 share versions of 16-bit, 32-bit, and 64-bit words:

The library also contains the following generic types:

The generic types are defined to one of the previous types based on the value of AEAD_MASKING_SHARES. For example, if the number of shares is 3, then the generic types are mapped as follows:

Your code can use the explicit versions if you need an exact number of shares, or you can use the generic types for the default number of shares that is specified by AEAD_MASKING_SHARES, usually 4. Using the generic types is recommended because then it is easy to recompile your code to use a different sharing ratio.

Operations

The following generic macros are defined to assist with working with masked words. The equivalent plain C operation is shown on the right:

mask_input(w, x) w = mask(x)
mask_output(w) return unmask(w)
mask_xor_const(w, c) w ^= c
mask_xor(w1, w2) w1 ^= w2
mask_xor3(w1, w2, w3) w1 ^= (w2 ^ w3)
mask_not(w) w = ~w
mask_and(w1, w2, w3) w1 ^= (w2 & w3)
mask_and_not(w1, w2, w3) w1 ^= ((~w2) & w3)
mask_or(w1, w2, w3) w1 ^= (w2 | w3)
mask_shl(w1, w2, bits) w1 = (w2 << bits)
mask_shr(w1, w2, bits) w1 = (w2 >> bits)
mask_rol(w1, w2, bits) w1 = (w2 <<< bits)
mask_ror(w1, w2, bits) w1 = (w2 >>> bits)
mask_swap(w1, w2) swaps w1 and w2
mask_swap_move(w1, w2, mask, shift) swaps bits in w1 and w2

These operations will work on 16-bit, 32-bit, and 64-bit masked words, adapting automatically to the underlying masked word type.

There are also versions of these macros for each explcit sharing ratio between 2 and 6. For example, if you wanted to force the use of the 3-share versions, you would use the following macros instead of the generic ones:

mask_x3_input(w, x) w = mask(x)
mask_x3_output(w) return unmask(w)
mask_x3_xor_const(w, c) w ^= c
mask_x3_xor(w1, w2) w1 ^= w2
mask_x3_xor3(w1, w2, w3) w1 ^= (w2 ^ w3)
mask_x3_not(w) w = ~w
mask_x3_and(w1, w2, w3) w1 ^= (w2 & w3)
mask_x3_and_not(w1, w2, w3) w1 ^= ((~w2) & w3)
mask_x3_or(w1, w2, w3) w1 ^= (w2 | w3)
mask_x3_shl(w1, w2, bits) w1 = (w2 << bits)
mask_x3_shr(w1, w2, bits) w1 = (w2 >> bits)
mask_x3_rol(w1, w2, bits) w1 = (w2 <<< bits)
mask_x3_ror(w1, w2, bits) w1 = (w2 >>> bits)
mask_x3_swap(w1, w2) swaps w1 and w2
mask_x3_swap_move(w1, w2, mask, shift) swaps bits in w1 and w2

How do the operations work?

This section provides details on how each of the operations are defined, and how that leads to an effective masked implementation. For simplicity, we will use the 3-share versions unless otherwise stated.

Input and output

The input to a cipher must be masked and converted into shares before any other operations can be performed:

uint32_t x = ...;

Behind the scenes, mask_x3_input() expands to:

w.a = x ^ w.b ^ w.c;

As can be seen, masking involves generating random values for the second and subsequent shares and then XOR'ing it with the input to create the first share. After masking, the original value is no longer operative; w.a, w.b, and w.c are all randomized. For the 4-share and higher variants, more random words are generated and XOR'ed with x.

To recover the results at the end of the cipher, the shares are recombined using mask_x3_output():

uint32_t y = mask_x3_output(w);

This expands to:

uint32_t y = w.a ^ w.b ^ w.c;

XOR

XOR operations are very simple. The mask_x3_xor() macro applies XOR to each of the shares individually:

mask_x3_xor(w1, w2) :
w1.a ^= w2.a
w1.b ^= w2.b
w1.c ^= w2.c

Things are slightly different when XOR'ing round constants into a masked word:

w.a ^= c

Here, only the first share in the masked word is updated. The rest of the shares are left as-is. To show why we only update the first share, let's see what happens with the 2-share version if we were to XOR the constant with all shares instead of just the first:

w.b = rand
w.a = x ^ w.b
w.a ^= c
w.b ^= c
y = w.a ^ w.b
= (x ^ rand ^ c) ^ (rand ^ c)
= x

We expected "x ^ c" at the end, not x. The constant cancelled itself out when the final state was unmasked. If we only XOR the round constant into the first share, we get:

w.b = rand
w.a = x ^ w.b
w.a ^= c
y = w.a ^ w.b
= (x ^ rand ^ c) ^ rand
= x ^ c

This gives us the result we wanted.

Finally, mask_xor3() can be used to XOR three masked words together:

mask_x3_xor3(w1, w2, w3) :
w1.a ^= (w2.a ^ w3.a)
w1.b ^= (w2.b ^ w3.b)
w1.c ^= (w2.c ^ w3.c)

The 3-word version of XOR can be more efficient than two separate calls to mask_xor() because the intermediate values for each share can be kept in registers longer.

NOT

NOT operations are a special case of XOR'ing with the all-1's round constant. The following are equivalent for 32-bit masked words:

mask_x3_xor_const(w, 0xFFFFFFFF);

The mask_x3_not() function is more convenient because it works on 16-bit, 32-bit, and 64-bit words without the programmer having to supply an all-1's constant of the right size.

Let's expand mask_x3_not() and see what it is doing:

w.b = rand1
w.c = rand2
w.a = x ^ w.b ^ w.c
w.a = ~w.a
y = w.a ^ w.b ^ w.c
= ~(x ^ rand1 ^ rand2) ^ rand1 ^ rand2
= (~x) ^ rand1 ^ rand2 ^ rand1 ^ rand2
= ~x

AND

The AND operation is the most complex of the masked operations. Here is the 4-share version:

mask_x4_and(w1, w2, w3) :
w1.a ^= (w2.a & w3.a)
mix(w1.a, w2.a, w3.a, w1.b, w2.b, w3.b)
mix(w1.a, w2.a, w3.a, w1.c, w2.c, w3.c)
mix(w1.a, w2.a, w3.a, w1.d, w2.d, w3.d)
w1.b ^= (w2.b & w3.b)
mix(w1.b, w2.b, w3.b, w1.c, w2.c, w3.c)
mix(w1.b, w2.b, w3.b, w1.d, w2.d, w3.d)
w1.c ^= (w2.c & w3.c)
mix(w1.c, w2.c, w3.c, w1.d, w2.d, w3.d)
w1.d ^= (w2.d & w3.d)
mix(a1, a2, a3, b1, b2, b3) :
a1 ^= temp
b1 ^= temp ^ (b3 & a2) ^ (b2 & a3)

Essentially this is performing a diagonal matrix operation where each share is mixed with all of the shares below it and to the right. At each step, a new random number is generated and mixed into the state.

Note: The mask_and(), mask_and_not(), and mask_or() macros require a local variable called "temp" in the current scope that is of type uint16_t, uint32_t, or uint64_t depending upon the type of masked word you are operating on.

For a masked word with N shares, each AND operation requires (N² - N) / 2 random number generator calls and mix operations. This has a significant impact on performance compared with the non-masked version of AND. The higher the number of shares, or the higher the number of AND operations, the slower the masked cipher will become.

AND-NOT

A common variant of AND is to invert the first argument. For example, here is step chi of Xoodoo:

x00 ^= (~x10) & x20;
x10 ^= (~x20) & x00;
x20 ^= (~x00) & x10;
x01 ^= (~x11) & x21;
x11 ^= (~x21) & x01;
x21 ^= (~x01) & x11;
x02 ^= (~x12) & x22;
x12 ^= (~x22) & x02;
x22 ^= (~x02) & x12;
x03 ^= (~x13) & x23;
x13 ^= (~x23) & x03;
x23 ^= (~x03) & x13;

While it is possible to create temporaries to hold the NOT-inverted intermediate values, it is easier and more memory-efficient to use mask_and_not():

mask_and_not(x00, x10, x20);
mask_and_not(x10, x20, x00);
mask_and_not(x20, x00, x10);
mask_and_not(x01, x11, x21);
mask_and_not(x11, x21, x01);
mask_and_not(x21, x01, x11);
mask_and_not(x02, x12, x22);
mask_and_not(x12, x22, x02);
mask_and_not(x22, x02, x12);
mask_and_not(x03, x13, x23);
mask_and_not(x13, x23, x03);
mask_and_not(x23, x03, x13);

OR

The OR operation is implemented in terms of AND using DeMorgan's Law: (A | B) = ~(~(A) & ~(B)). The 4-share version of OR is:

mask_x4_or(w1, w2, w3) :
w1.a ^= (w2.a | w3.a)
mix(w1.a, ~w2.a, ~w3.a, w1.b, w2.b, w3.b)
mix(w1.a, ~w2.a, ~w3.a, w1.c, w2.c, w3.c)
mix(w1.a, ~w2.a, ~w3.a, w1.d, w2.d, w3.d)
w1.b ^= (w2.b & w3.b)
mix(w1.b, w2.b, w3.b, w1.c, w2.c, w3.c)
mix(w1.b, w2.b, w3.b, w1.d, w2.d, w3.d)
w1.c ^= (w2.c & w3.c)
mix(w1.c, w2.c, w3.c, w1.d, w2.d, w3.d)
w1.d ^= (w2.d & w3.d)

As mentioned previously, the NOT operation only affects the first share, so only some of the terms need to be inverted to convert OR into AND.

Shift and rotate

The mask_x3_shl(), mask_x3_shr(), mask_x3_rol(), and mask_x3_ror() functions perform a left shift, right shift, left rotate, and right rotate respectively. Like mask_x3_xor(), they operate on each share individually.

mask_x3_shl(w1, w2, bits) :
w1.a = (w2.a << bits)
w1.b = (w2.b << bits)
w1.c = (w2.c << bits)

The first two arguments can be the same masked word for in-place shift and rotate operations:

mask_x3_shl(w, w, bits) :
w.a <<= bits
w.b <<= bits
w.c <<= bits

Swap

The mask_swap() function swaps two masked words and the mask_swap_move() function performs a swap on some of the bits in two masked words.