A Detailed Look At The Openssl Implementation Of Aes: Key Schedule, S Box, And Counter Mode

A comprehensive technical exploration of a detailed look at the openssl implementation of aes: key schedule, s box, and counter mode, covering key concepts, practical implementations, and real-world applications.
Contents
Introduction: Peering Under the Hood of OpenSSL’s AES Engine
In the quiet hum of a modern data center, countless bytes are being shuttled between servers and clients, wrapped in layers of cryptographic protection. You don’t see them, but AES—the Advanced Encryption Standard—is the silent workhorse behind HTTPS, VPNs, file encryption, and a thousand other digital transactions. We trust it implicitly. But trust in cryptography is not blind faith; it is earned through rigorous analysis, careful implementation, and the sobering awareness that even a perfectly designed algorithm can be rendered useless by a single coding misstep.
OpenSSL is the most widely used open-source cryptographic library in the world. From enterprise load balancers to tiny embedded IoT sensors, OpenSSL’s AES implementation is invoked billions of times every second. Yet few developers ever look past the high-level API calls like AES_set_encrypt_key() and AES_ctr128_encrypt(). We treat OpenSSL as a black box, assuming its code is battle-tested and secure. And it largely is—except when it isn’t. Remember Heartbleed? That was a buffer over-read in OpenSSL’s TLS heartbeat extension, not a cipher flaw, but it underscored a painful truth: implementation matters as much as algorithm design. A sophisticated attack on AES implementation—a cache-timing side channel leaking an S‑box access pattern—can recover a secret key just as effectively as a mathematical cryptanalysis of the cipher itself.
Understanding how OpenSSL implements AES is not just an academic exercise. It’s a crash course in the trade-offs between speed, memory, and security—the very same trade-offs that every high‑performance security engineer must navigate. In this post, we’ll take a detailed look at three critical components of OpenSSL’s AES engine: the key schedule, the S‑box, and the Counter (CTR) mode. These are the nuts and bolts that determine whether your encryption leaks secrets through side channels, whether your key schedule is constant-time, and whether your CTR implementation can be safely parallelized. We’ll dive into the actual C code, examine assembly optimizations, and explore the decades‑old cat‑and‑mouse game between cryptographers and side‑channel attackers. By the end, you’ll have a mental model of OpenSSL’s AES that goes far beyond “it just works.”
1. The AES Algorithm: A Quick Refresher
Before dissecting OpenSSL’s implementation, we need a common vocabulary. AES is a symmetric block cipher that operates on 16‑byte (128‑bit) blocks using key sizes of 128, 192, or 256 bits. The algorithm—originally designed by Joan Daemen and Vincent Rijmen as Rijndael—consists of a fixed number of rounds (10 for AES‑128, 12 for AES‑192, 14 for AES‑256). Each round applies four transformations:
- SubBytes – a nonlinear byte substitution using a fixed S‑box (a 16×16 lookup table).
- ShiftRows – a cyclic rotation of the rows of the state matrix.
- MixColumns – a linear mixing operation on each column (absent in the final round).
- AddRoundKey – XOR of the state with a round key derived from the master key via the key schedule.
The round keys themselves are generated by expanding the master key into a linear array of 32‑bit words. For AES‑128, 44 words (11 round keys) are produced using the RotWord and SubWord operations along with round constants.
From a software perspective, the most computationally intensive parts are the S‑box substitution (which involves table lookups) and the MixColumns multiplication (which in most implementations uses a second set of lookup tables for speed). OpenSSL, like many libraries, historically employed large precomputed tables (T‑tables) to perform SubBytes, ShiftRows, and MixColumns in a single pass. However, these tables became the target of devastating cache‑timing attacks. Over the years, OpenSSL has evolved to include constant‑time tableless implementations and hardware‑accelerated paths via AES‑NI instructions.
Now let’s zoom into the three critical components that every serious implementer must get right.
2. The Key Schedule: From Master Key to Round Keys
The key schedule is the first piece of code that runs when you call AES_set_encrypt_key(). Its job is to expand the relatively short user key (16, 24, or 32 bytes) into a longer array of round keys (16×(rounds+1) bytes). If the key schedule is flawed—either leaking key material through timing or using an insecure derivation—the entire encryption collapses.
How OpenSSL Expands the Key
OpenSSL’s key expansion routine for AES‑128 is found in crypto/aes/aes_core.c (or in platform‑specific assembly). The core loop processes the 44‑word schedule. For the first 4 words (Nk = 4), the routine simply copies the master key. Then, for each subsequent word at index i:
- If
i mod Nk == 0: rotate the previous word by one byte (RotWord), substitute each byte via the S‑box (SubWord), and XOR with the round constantrcon[i/Nk]. - Otherwise: XOR the word
Nkpositions back with the previous word.
The round constants are small values derived from 2^{i-1} in GF(2^8). OpenSSL stores them in a static array.
For AES‑192 and AES‑256 the procedure is similar but with different values of Nk and Nb, and an extra SubWord step for AES‑256 every four words.
A naive implementation uses loops and table lookups, which can be constant‑time if the tables are small and accessed in a data‑independent manner. But many early versions of OpenSSL performed key expansion with variable‑length loops or branches that leaked information about the number of round constants. Fortunately, the key schedule is usually performed once per session, so timing side‑channels here are less critical than in the encryption hot path. However, attacks like the “FLUSH+RELOAD” technique can extract AES keys from key schedule tables if the attacker can spy on the last‑level cache.
Constant‑Time Key Schedule: Why It Matters
Suppose an attacker can measure the time it takes to expand a key. If the expansion uses a conditional branch that depends on the key value (e.g., to handle the extra SubWord in AES‑256), the timing may reveal bits of the key. Modern OpenSSL (since 1.0.1) uses constant‑time loops for the key schedule, ensuring that every path executes the same number of instructions regardless of the key data. This is achieved by:
- Unrolling loops to a fixed number of iterations (e.g., always process 44, 52, or 60 words even for shorter keys).
- Using constant‑time rotations and substitutions (table lookups that don’t branch on secret data).
- Avoiding early exit conditions.
Here is a simplified excerpt from aes_core.c (the table‑based path) that demonstrates the constant‑time approach for the XOR and substitution steps:
static void aes_key_schedule_128(const unsigned char *userkey, AES_KEY *key) {
u32 *rk = key->rd_key;
int i;
u32 temp;
rk[0] = GETU32(userkey );
rk[1] = GETU32(userkey + 4);
rk[2] = GETU32(userkey + 8);
rk[3] = GETU32(userkey + 12);
for (i = 0; i < 10; i++) { // fixed 10 rounds for AES-128
temp = rk[3];
rk[4] = rk[0] ^
(Te0[(temp >> 16) & 0xff] & 0xff000000) ^
(Te1[(temp >> 8) & 0xff] & 0x00ff0000) ^
(Te2[(temp ) & 0xff] & 0x0000ff00) ^
(Te3[(temp >> 24) ] & 0x000000ff) ^
rcon[i];
rk[5] = rk[1] ^ rk[4];
rk[6] = rk[2] ^ rk[5];
rk[7] = rk[3] ^ rk[6];
rk += 4;
}
}
Notice that the loop runs exactly 10 times (constant), and the S‑box substitution is performed using the same T‑table lookups that are used for encryption. The Te0‑Te3 tables are 1024‑byte each and combine SubBytes, ShiftRows, and MixColumns into a single lookup. The key schedule borrows these tables, which is efficient but also means that an attacker who can spy on table accesses during key expansion can recover the key. This has led to the development of dedicated constant‑time key‑expansion routines that use bit‑slicing or vector permute instructions.
Hardware‑Accelerated Key Schedule
Modern x86 processors with AES‑NI (AES New Instructions) provide a single instruction AESKEYGENASSIST that performs the RotWord, SubWord, and rcon combination in hardware, without any table lookups. OpenSSL’s assembly routines (e.g., aesni-set-key.asm) leverage this to produce constant‑time, fast key expansion. The instruction takes a 128‑bit register containing the last 4 words and an immediate round constant, and outputs the next 4 words. Since the hardware is deterministic and data‑independent, there is no timing leakage from the key schedule itself. This is the gold standard.
3. The S‑Box: Security’s Nonlinear Heart
The AES S‑box is an 8‑bit bijection designed to introduce nonlinearity into the cipher. It is constructed from two operations:
- Take the multiplicative inverse in GF(2^8) with the irreducible polynomial
x^8 + x^4 + x^3 + x + 1. The element 0 is mapped to 0. - Apply a fixed affine transformation (bit‑wise matrix multiply and XOR with 0x63).
The result is a 16×16 substitution table. The S‑box is the only nonlinear component in AES; if an attacker can recover the exact values being looked up during encryption, they can deduce the key.
Table‑Based S‑box: The Classic Implementation
The simplest way to implement SubBytes is to precompute a 256‑byte array SBOX[256] and perform state[i] = SBOX[state[i]]. OpenSSL’s T‑table approach actually combines SubBytes with ShiftRows and MixColumns, but at its core each T‑table is a 256‑entry lookup of 32‑bit words. The first T‑table (Te0) for byte x contains the result of SubBytes(x) mixed with the MixColumns matrix multiplication. The encryption loop looks like:
temp = Te0[(state >> 24) & 0xff] ^
Te1[(state >> 16) & 0xff] ^
Te2[(state >> 8) & 0xff] ^
Te3[(state ) & 0xff] ^
rk[j];
This is extremely fast: four table lookups per column per round. For AES‑128 with 10 rounds and 4 columns, that’s 160 table lookups per block. On modern CPUs with fast caches, this yields throughputs of several hundred MB/s.
The Cache‑Timing Problem
A table lookup accesses memory at an address derived from a secret byte. On most processors, the time to load a value from main memory depends on whether the address is cached. If the attacker can observe encryption times for many plaintext blocks (or trigger cache evictions), they can infer which table indices were accessed. Because the index is the secret state byte (which depends on both plaintext and key), the attacker can eventually recover the key.
The classic attack by Bernstein (2005) and later Osvik et al. (2006) exploited the T‑table accesses in OpenSSL’s AES implementation. The attack works by:
- The attacker controls the plaintext or ciphertext.
- They measure encryption time with high precision.
- By varying input bytes and detecting timing differences, they deduce which cache lines were hit or missed.
- After thousands of measurements, they recover the full round key.
OpenSSL’s T‑tables are 1024 bytes each (256 × 4 bytes), and they occupy exactly 4 cache lines of 64 bytes (or 8 lines for a 32‑byte cache line architecture). The mapping from a secret byte b to the cache line index is simply b & 63 (for 64‑byte lines) or b & 31 (for 32‑byte lines). This means only the 6 lower bits of each secret byte affect cache timing—but combined with multiple lookups, the entire key can be extracted.
Mitigations: Constant‑Time S‑box
To counter cache‑timing attacks, OpenSSL evolved from the T‑table approach to a bitsliced implementation. “Bitslicing” re‑encodes the AES state across many registers, so that each bit of the S‑box output is computed using logical operations (AND, XOR, OR) without any secret‑dependent memory accesses.
In OpenSSL’s aes_core.c, the bitsliced implementation (used when AES‑NI is not available) processes 8 blocks in parallel, each block having its bits distributed across 8 registers of the same name. The S‑box is computed using a series of logical operations based on the finite field inversion formula. This approach is fully data‑independent and thus immune to cache‑timing attacks. The performance, however, is significantly lower than table‑based AES—around 8‑10 cycles per byte versus 2‑3 for table‑based, but it’s a necessary trade‑off for security on machines without hardware AES.
OpenSSL has also added a vector‑permute implementation (using SSE4.1 pshufb instructions) that computes the S‑box in constant time by looking up a 16‑byte table (one cache line) in a data‑independent manner. The pshufb instruction performs a 16‑entry lookup within a 128‑bit register; by decomposing the 8‑bit input into a high nibble and low nibble, two pshufb operations simulate a full 256‑entry S‑box without touching memory. This is both constant‑time and fast (around 3‑4 cycles per byte).
OpenSSL’s Evolution
OpenSSL 1.0.1 (2012) introduced the AES_ASM preprocessor flag that enables assembly implementations. On x86, it chooses between:
aes_core.c(table‑based, not constant‑time)aes_x86core.c(bitsliced constant‑time, used whenAES_ASMis defined but AES‑NI is absent)aesni-*(hardware accelerated, constant‑time)
Later versions, starting with OpenSSL 1.1.0, default to AES‑NI on capable hardware and fall back to the vector‑permute constant‑time implementation. The old table‑based path is still available for legacy platforms but is deprecated from a security perspective.
4. CTR Mode: Turning a Block Cipher into a Stream Cipher
Counter (CTR) mode is one of the most widely used modes of operation. It converts a block cipher into a stream cipher by encrypting successive counter values and XORing the output with the plaintext. The counter can be any non‑repeating value, typically composed of a nonce (random per message) and a block counter incremented for each block.
Why CTR Mode Matters
CTR mode is popular because:
- It supports parallel encryption/decryption (each keystream block can be computed independently).
- Random access decryption is possible (decrypt any block without processing previous ones).
- No padding is required (the stream can be truncated to exact plaintext length).
OpenSSL provides AES_ctr128_encrypt(), which encrypts an arbitrary‑length buffer using a 128‑bit counter. The implementation must handle partial blocks at the end, increment the counter correctly (big‑endian), and ensure that the counter never repeats across different messages with the same key.
How OpenSSL Implements CTR
The core of the CTR implementation is in crypto/aes/aes_ctr.c. The function takes:
void AES_ctr128_encrypt(const unsigned char *in,
unsigned char *out,
size_t length,
const AES_KEY *key,
unsigned char counter[AES_BLOCK_SIZE],
unsigned char ecount_buf[AES_BLOCK_SIZE],
unsigned int *num);
The num parameter tracks how many bytes of the current keystream block have already been used (for partial updates). The ecount_buf holds the last encrypted keystream block. On each call, the function:
- If
*num > 0, use the remaining bytes fromecount_buffirst. - For each full 16‑byte block remaining, encrypt the current counter value using
AES_encrypt()(or a direct assembly routine), XOR with the input, store to output, then increment the counter. - For a final partial block, encrypt the counter, XOR only the necessary bytes, and store the unused part back into
ecount_bufwith updatednum.
The counter increment is performed in big‑endian order: treat the 16‑byte counter as a 128‑bit big‑endian integer and add one. OpenSSL uses a straightforward loop that increments from the least significant byte (last byte of the array) and propagates carries. Since the counter is public (non‑secret), the timing of this loop does not leak secrets, but it must be correct to avoid counter overlap.
Parallelism and the Threat of Cache‑Timing in CTR
Because CTR mode encrypts only the counter values (which are known and often sequential), the S‑box accesses (in a table‑based implementation) are based on the counter, not on any secret. This means that cache‑timing attacks on the keystream generation are much harder to mount because the attacker does not control the counter. However, the initial encryption of the counter still uses the round key derived from the secret key. The round keys themselves may be accessed in a data‑dependent manner during the encryption of the counter, potentially leaking key material. This is why constant‑time implementations for the underlying block cipher are still necessary even in CTR mode.
Another subtlety: if the attacker can influence the counter value (e.g., through a chosen‑IV attack), they could force specific counter bytes into the cache lines and observe timing variations. In practice, nonces are usually random or sequential and not attacker‑controlled, but the principle of defense‑in‑depth recommends using constant‑time block encryption regardless of the mode.
Hardware Acceleration for CTR
With AES‑NI, CTR mode becomes extremely efficient. The AESENC instruction decrypts a single block (four rounds per instruction for inner rounds). For CTR, OpenSSL’s assembly routine (aesni_ctr32_encrypt_blocks) processes up to 8 blocks in parallel using vector registers. It prefetches the counter values, increments them using vector addition (using PADDQ for the low quadword), and encrypts all blocks simultaneously. The throughput on modern Intel processors exceeds 1 GB/s per core.
5. Beyond the Basics: Side‑Channel Defenses, Benchmarks, and Comparisons
Constant‑Time Programming in OpenSSL
The journey from the original T‑table AES to today’s constant‑time implementations was not smooth. OpenSSL had to balance performance with security. For many years, the default build included the fast but vulnerable table‑based implementation. Security‑conscious users had to compile with -DAES_ASM or switch to libcrypto’s “no‑asm” fallback that used a slower but constant‑time bitsliced path.
Starting with OpenSSL 1.1.0, the build system automatically detects AES‑NI support. If present, it uses the constant‑time hardware path. If not, it selects a constant‑time software implementation (vector‑permute on x86 with SSE4.1, or bitsliced on ARM). The old T‑table path is still compiled but not used by default unless the user explicitly disables the assembly.
Performance Numbers
Let’s compare approximate throughputs on a 3.0 GHz Intel Skylake core (measured with openssl speed -evp aes-128-ctr):
- T‑table (C only): ~150 MB/s
- Bitsliced constant‑time: ~80 MB/s
- Vector‑permute (SSSE3): ~280 MB/s
- AES‑NI: ~1.5 GB/s (single thread)
The AES‑NI advantage is enormous—nearly 10× faster than the best software constant‑time method. This is why modern systems overwhelmingly prefer hardware AES.
Comparisons with Other Libraries
- BoringSSL (Google’s fork of OpenSSL) has removed all non‑constant‑time table‑based AES. It always uses AES‑NI or a bitsliced fallback.
- libsodium uses ChaCha20 instead of AES by default, but its AES implementation (when available) also uses hardware acceleration.
- WolfSSL offers a small footprint and constant‑time AES using bitslicing.
OpenSSL’s advantage is its maturity and pervasive adoption, but its legacy of vulnerable table‑based implementations means that older versions (pre‑1.1.0) should be avoided for security‑critical applications.
Conclusion: The Unfinished Journey
OpenSSL’s AES engine is a fascinating case study in the evolution of secure software engineering. From the early days of naive table‑based implementations to the modern era of hardware‑assisted constant‑time cryptography, each iteration reflects a response to a discovered vulnerability. The key schedule, once a simple loop with no side‑channel considerations, is now crafted to be data‑independent. The S‑box, once a single 256‑byte array that leaked every key through your CPU cache, is now computed in registers using bitslicing or vector permutations. CTR mode, while inherently less exposed, still relies on a secure block cipher underneath.
Yet the game is not over. New side‑channel attacks continue to emerge: branch prediction attacks (Spectre), power analysis, and electromagnetic leakage. OpenSSL must constantly adapt. The library now includes framework‑wide mitigations like constant‑time conditional moves, cache line fencing, and even runtime detection of microarchitectural vulnerabilities.
For the security engineer, the lesson is clear: never trust a cryptographic implementation simply because it is widely used. Peek under the hood. Understand the trade‑offs. And when possible, let the hardware do the heavy lifting with AES‑NI, but always ensure a constant‑time fallback is in place. The quiet hum of the data center deserves nothing less.