Implementing The Rsa Cryptosystem: Key Generation, Encryption, And Signature With Chinese Remainder Theorem

2025-11-30 · Leonardo Benicio

Summary

A comprehensive technical exploration of implementing the rsa cryptosystem: key generation, encryption, and signature with chinese remainder theorem, covering key concepts, practical implementations, and real-world applications.

Here is the expanded blog post, structured as a comprehensive deep dive into the cryptosystem. The original introduction is preserved and significantly expanded upon to reach the requested depth and length.

The Quiet Giant: Performance, Security, and the RSA Cryptosystem

Part I: The Invisible Handshake

Introduction

In the vast, invisible architecture of the modern internet, few protocols are as foundational—and as quietly taken for granted—as the RSA cryptosystem. Every time you type https:// into your browser and a little padlock icon appears, you are witnessing the culmination of a mathematical feat that was, for the first thirty years of public-key cryptography, considered by many to be theoretically elegant but computationally impractical. That you are reading this text, likely over a secure connection that utilizes an asymmetric handshake, is a testament to the fact that RSA was not only practical but also brilliantly adaptable. It is the quiet giant upon which a significant portion of the digital world’s trust infrastructure is built.

But as with any giant, its strength comes with a heavy cost. RSA’s security is predicated on the computational intractability of factoring large composite numbers—specifically, the product of two large primes, typically 2048, 3072, or even 4096 bits in length. This security guarantee, however, is married to a significant performance penalty. The core operations of RSA—encryption, decryption, signing, and verification—are all built on modular exponentiation. When you are raising a large number (like your encrypted message) to a multi-thousand-bit exponent (like a private key) modulo another multi-thousand-bit number (the modulus), you are not performing simple arithmetic. You are asking your CPU to perform tens of millions of bit-level operations. In the early days of the web, a single SSL handshake could take several seconds. For a web server handling thousands of concurrent connections, this was not just a performance bottleneck; it was a potential deal-breaker for the entire model of secure e-commerce.

The question, then, is not whether RSA works, but how it works under the hood, and why it works at all in a world that demands millisecond latency. The answer lies in a century of number theory, a suite of brilliant algorithmic optimizations, and a keen understanding of the hardware that does the heavy lifting. This article is not another high-level explanation of public-key cryptography. Instead, we will journey into the absolute core of the RSA algorithm. We will explore the mechanical engineering of the cryptosystem: the modular exponentiation algorithms (specifically the Binary Method), the practical computational cost of large-number arithmetic, and the specific optimizations (like the Chinese Remainder Theorem) that transform RSA from a theoretical curiosity into a production-ready workhorse.

Moreover, we will confront the elephant in the room: the profound tension between performance and security. We will analyze how key size affects throughput, how side-channel attacks exploit the performance optimizations we put in place, and why the world is (slowly) migrating toward elliptic curve cryptography (ECC) not because RSA is broken, but because the cost of its security is becoming too heavy for a mobile-first, high-concurrency world.

Roadmap:

Section II: The Primal Foundation — A deep(er) dive into the number theory behind RSA, including why Euler’s theorem works and how key generation actually happens.
Section III: The Heavy Lifter — Modular Exponentiation — We dissect the binary method and the “square-and-multiply” algorithm, with a worked example for a 2048-bit key.
Section IV: The Arithmetic of Giants — We look at how a 2048-bit number is stored in memory and the cost of multiplication, addition, and reduction.
Section V: The Great Optimizer — The Chinese Remainder Theorem — A detailed breakdown of how CRT reduces decryption time by ~75% and why it is standard practice.
Section VI: The Expensive Padding — OAEP and PSS — We explore how RSA must be padded to be secure, and the performance cost of that padding.
Section VII: The Performance-Security Trade-off — A Deep Analysis — We analyze key size vs. throughput, the impact of side-channel attacks (timing, power analysis), and the role of hardware acceleration (AES-NI, ARM Crypto Extensions).
Section VIII: The Future — ECC, Post-Quantum, and the Slow Death of RSA — We conclude with a look at the alternatives and the inevitability of migration.

Section II: The Primal Foundation — A Deeper Dive into Number Theory

To understand the performance of RSA, we must first understand the structure of the numbers it manipulates. The security of RSA is built on the RSA problem: given a modulus ( N = p \times q ) (where ( p ) and ( q ) are large primes) and an exponent ( e ), it is computationally hard to compute the ( e )-th root of a ciphertext ( C ) modulo ( N ) without knowing the factorization of ( N ). This is the trapdoor.

But the mechanics of RSA rely on a different piece of math: Euler’s Theorem. This theorem states that for any integer ( a ) coprime to ( N ), we have:

[ a^{\phi(N)} \equiv 1 \pmod{N} ]

Where ( \phi(N) ) is Euler’s totient function. For an RSA modulus ( N = p \times q ), we know that ( \phi(N) = (p-1)(q-1) ).

Key Generation: The Cost of Discovery

The performance journey of RSA begins before any encryption happens. Key generation is computationally expensive, but it only happens once per key pair (or infrequently when renewing certificates). Generating a 2048-bit RSA key pair involves several steps:

Prime Generation: Find two random, large, probable primes ( p ) and ( q ).
- This is a probabilistic algorithm (e.g., Miller-Rabin test). The algorithm runs many iterations to achieve a high confidence level (e.g., ( 2^{-128} ) chance of error).
- For a 2048-bit modulus, ( p ) and ( q ) are each 1024 bits long. Finding a 1024-bit prime requires generating random 1024-bit numbers and testing them.
- The Prime Number Theorem suggests the density of primes near ( 2^{1024} ) is about ( 1/\ln(2^{1024}) \approx 1/710 ). So, roughly one in every 710 odd numbers is prime. However, the Miller-Rabin test has a failure probability of ( 1/4 ) per iteration for a composite number. To achieve a false positive rate of ( 2^{-80} ), we need many iterations (e.g., 40). This is a significant cost source.
Modulus Calculation: Compute ( N = p \times q ). This is trivial but requires a large-integer multiplication.
Exponent Selection: Choose a public exponent ( e ). Common choices are ( e = 65537 ) (( 2^{16} + 1 )) or ( e = 3 ). The exponent must be coprime to ( \phi(N) ). Performance-wise, ( e = 65537 ) is great because it has only two bits set to 1 (the 0th and the 16th bit). This makes modular exponentiation very fast (we will see why in the next section).
Private Exponent Calculation: Compute ( d = e^{-1} \mod \phi(N) ) using the Extended Euclidean Algorithm. This is fast but involves working with giant numbers.

The key takeaway: Key generation is expensive because of prime generation. For a server generating a new key pair, this might take tens to hundreds of milliseconds. It is negligible in a web server’s lifetime but is a critical consideration for embedded systems or IoT devices.

Section III: The Heavy Lifter — Modular Exponentiation

The core of RSA is the operation:

[ \text{Ciphertext } C = M^e \mod N ] [ \text{Plaintext } M = C^d \mod N ]

Where ( M ) is the message (as an integer ( < N )), ( e ) is the public exponent, ( d ) is the private exponent, and ( N ) is the modulus.

The Naïve Approach — A Disaster

A naïve developer might think: “I’ll just compute ( M^e ) and then take the modulus.” This is disastrous for two reasons:

Size: ( M^e ) is astronomically large. If ( M ) is a 2048-bit number and ( e = 65537 ), then ( M^e ) has roughly ( 2048 \times 65537 \approx 134, \text{million bits} ). That is about 16 MB of memory just for the intermediate value. It would take an impossible amount of time and memory.
Time: Multiplying a 2048-bit number by itself 65,536 times is computationally infeasible.

The Solution: The Binary Method (Square-and-Multiply)

Modular exponentiation is solved by a simple but powerful algorithm: the Binary Method, also known as Square-and-Multiply. The idea is based on the fact that any exponent ( e ) can be written in binary.

Let’s break it down for ( e = 65537 ). In binary, 65537 is ( 1_0000_0000_0000_0001_2 ). This is a 17-bit number with only two 1s (the most significant and the least significant bits).

The algorithm works like this:

Set ( result = 1 ).
For each bit of the exponent, from the most significant bit down to the least significant bit:
- Square: ( result = result^2 \mod N )
- Multiply: If the current bit of the exponent is 1, multiply the result by the base: ( result = result \times M \mod N ).

Why is this fast?

Instead of ( e ) multiplications, we only need ( \lfloor log_2(e) \rfloor ) squarings, plus the number of 1s in the exponent.
For ( e = 65537 ), we need 16 squarings and 2 multiplications (one for the first 1, one for the last 1).
For a 2048-bit exponent (like a private key ( d )), we need 2047 squarings and, on average, 1024 multiplications (since 50% of bits are 1).

A Worked Example (Small Numbers)

Let’s say ( M = 5 ), ( N = 7 ), and ( e = 3 ) (binary 11).

Start: result = 1
Bit 1 (MSB): Square: ( 1^2 = 1 ). Multiply (bit is 1): ( 1 \times 5 = 5 ).
Bit 0 (LSB): Square: ( 5^2 = 25 \equiv 4 \mod 7 ). Multiply (bit is 1): ( 4 \times 5 = 20 \equiv 6 \mod 7 ).

Result: ( 5^3 \mod 7 = 125 \mod 7 = 6 ). Correct!

The Real Performance Bottleneck

Now, the algorithm is ( O(\log e) ) multiplications, but each multiplication is between two numbers of size ( N ) (2048 bits). A 2048-bit multiplication is not trivial. On a modern 64-bit CPU, a 2048-bit number is stored as an array of 32 unsigned 64-bit integers. Multiplying two such numbers (without using special CPU instructions) requires ( 32 \times 32 = 1024 ) 64-bit multiplications, plus the addition of the carry bits.

This is where the “performance” story gets interesting. The CPU is doing a massive amount of mathematical heavy lifting in a very tight loop. For a private key operation (decryption or signing), a 2048-bit exponent means about 2047 squarings and ~1024 multiplications. Each of those squarings requires a full 1024-bit x 1024-bit multiplication (since you are squaring a 2048-bit number, but you only care about the lower 2048 bits). This is the cost.

Theoretical Throughput Estimate

Encryption (public key, ( e=65537 )): ~18 modular multiplications. Very fast.
Decryption (private key, 2048-bit ( d )): ~3071 modular multiplications. Much slower.
Ratio: Decryption is about 170x slower than encryption in terms of multiplication count. This asymmetry is fundamental to RSA.

Section IV: The Arithmetic of Giants — 2048-bit Numbers in a 64-bit World

We need to understand the “machine room” of RSA. A 2048-bit integer is a big number. Modern CPUs, however, operate on words of 64 bits (or 32 on older systems). This discrepancy forces software implementations to use bignum arithmetic (or big integer arithmetic).

Representation

A 2048-bit number is typically stored as an array of ( k ) limbs. On a 64-bit system, ( k = 2048 / 64 = 32 ) limbs.

Limb 0: Bits 0-63 (least significant)
Limb 1: Bits 64-127
Limb 2: Bits 128-191
…
Limb 31: Bits 1984-2047 (most significant)

Multiplication: The Core Cost

The most expensive operation is multiplication. For two 2048-bit numbers ( A ) and ( B ), the product ( A \times B ) is a 4096-bit number. We only care about the lower 2048 bits (since we are taking mod N), but computing that product fully is complex.

The classic schoolbook long multiplication algorithm for 32-limb numbers requires ( O(k^2) = O(32^2) = 1024 ) 64-bit multiply operations. But each multiply produces a 128-bit result (two 64-bit limbs: low and high). We must add these results together, handling overflows.

Karatsuba Multiplication: An Optimization

To improve performance, many libraries (like OpenSSL and GMP) use the Karatsuba algorithm. This is a divide-and-conquer approach that reduces the number of multiplications from ( O(n^2) ) to ( O(n^{\log_2 3}) \approx O(n^{1.585}) ).

For RSA-sized numbers (32 limbs), Karatsuba is significantly faster than schoolbook. For example:

Schoolbook: ( 32^2 = 1024 ) multiplications.
Karatsuba (recursive split): approximately ( 32^{1.585} \approx 243 ) multiplications (plus extra additions). This is a 4x improvement.

Montgomery Reduction: Avoiding Division

After multiplication, we have a 4096-bit intermediate product. We need ( result \mod N ). The naïve way is to do a division, but division is even slower than multiplication (often 2-10x slower).

This is where Montgomery Reduction comes in. This is a truly beautiful piece of low-level optimization. Instead of computing ( X \mod N ) directly, Montgomery multiplication transforms the numbers into a “Montgomery domain.” In this domain, multiplication and reduction are combined. The algorithm:

Convert ( A ) and ( B ) to Montgomery form: ( A’ = A \times R \mod N ), where ( R = 2^{2048} ).
Compute ( P’ = A’ \times B’ \times R^{-1} \mod N ). This is the Montgomery multiplication.
- The magic is that ( P’ ) is computed without using a division. Instead, it uses a series of pre-computed constants and additions.
Convert back: ( P = P’ \times R^{-1} \mod N ).

The key benefit: Montgomery multiplication is roughly as fast as a standard multiplication, but it integrates the reduction. For RSA decryption (many squarings), you convert to Montgomery form once, then perform ~3000 Montgomery multiplications, then convert back. This massively reduces the cost of the reduction step (which would otherwise be a division per multiplication).

Performance Reality

On a modern x86-64 CPU (e.g., Intel Ice Lake), a 2048-bit Montgomery multiplication for RSA can cost around 100-150 CPU cycles per limb? Actually, more accurately, a single 2048-bit multiply + Montgomery reduction can consume around 500-1000 cycles depending on the library and CPU microarchitecture. This places the cost of a single private key operation (decryption) at roughly 3000 * 800 cycles = 2.4 million cycles. At 3 GHz, that’s about 0.8 ms for a single decryption. A server can handle ~1250 decryptions per second on a single core. This is the reality of the “heavy lifter.”

Section V: The Great Optimizer — The Chinese Remainder Theorem (CRT)

We just established that decryption (private key operation) is ~170x slower than encryption. Private key operations are used for signature generation (a primary function of a web server) and for decryption (e.g., for a VPN or email client). This disparity was a major problem.

The solution, discovered soon after RSA’s publication, is the Chinese Remainder Theorem (CRT) . CRT is a classic result from number theory that allows us to split a big problem into two smaller problems, solve them quickly, and combine the results.

The Math

We know ( N = p \times q ). Instead of computing:

[ M = C^d \mod N ]

We compute:

Pre-compute (once during key generation): ( dp = d \mod (p-1) ), ( d_q = d \mod (q-1) ), and ( q{inv} = q^{-1} \mod p ).
Two smaller exponentiations:
- ( M_1 = C^{d_p} \mod p )
- ( M_2 = C^{d_q} \mod q )
Combine (Garner’s formula):
- ( h = q_{inv} \times (M_1 - M_2) \mod p )
- ( M = M_2 + h \times q )

Why is this faster?

The exponents ( d_p ) and ( d_q ) are half the bit-length of ( d ). For a 2048-bit ( d ), they are 1024-bit numbers.
The size of the modulus is also halved (1024 bits vs 2048 bits).

Speedup Calculation:

Let’s compute the theoretical speedup. The cost of exponentiation is roughly ( O(\log e) \times O(k^2) ) where ( k ) is the number of limbs.

Without CRT: ( k = 32 ) limbs (2048 bits). Exponentiation cost: ( \sim 2048 \times 32^2 ).
With CRT: We need two exponentiations, each with ( k = 16 ) limbs (1024 bits). Each exponentiation cost: ( \sim 1024 \times 16^2 ).

Total cost with CRT: ( 2 \times (1024 \times 256) = 524,288 ) limb operations.

Total cost without CRT: ( 2048 \times 1024 = 2,097,152 ) limb operations.

Speedup = 2,097,152 / 524,288 = 4x.

Yes, CRT provides a factor of 4 speedup for private key operations. In practice, due to the overhead of the combination step and the fact that the exponent is still roughly half the length, the real-world speedup is often closer to 3.5x to 4x. This was a critical optimization that made RSA server-side operations acceptable.

Security Warning: The CRT is fragile

CRT is powerful but introduces a security risk. If a hardware fault (e.g., a cosmic ray flipping a bit, or an intentional fault injection attack) occurs during either ( M_1 ) or ( M_2 ), the combined result ( M ) will be wrong. An attacker can then, with high probability, factor the modulus ( N ) by computing ( gcd(M - M’, N) ).

This is the Bellcore attack (or fault attack). To defend against this, implementations typically verify the signature after decryption by checking that ( M^e \mod N = C ). If the check fails, the module returns an error. This adds a small overhead (a fast public key exponentiation) but is essential for security.

Section VI: The Expensive Padding — OAEP and PSS

RSA without proper padding is completely insecure. If you simply take a message ( M ) and compute ( M^e \mod N ), you are vulnerable to:

Textbook RSA Attack: If ( e = 3 ) and ( M ) is small, ( M^e < N ), then no modular reduction occurs, and you can take the cube root directly.
Chosen Ciphertext Attacks: Attackers can manipulate ciphertexts to decrypt unintended messages.

The Solution: OAEP (Optimal Asymmetric Encryption Padding)

OAEP is a padding scheme designed to prevent these attacks. It works in three steps:

Mask Generation: The message is XORed with a mask derived from a random seed.
Masking the Seed: The seed is XORed with another mask derived from the masked message.
Concatenation: The masked seed and masked message are concatenated and then converted to an integer and encrypted.

The decryption side reverses this.

The Performance Cost of OAEP

While OAEP is secure, it adds overhead:

Random Number Generation: OAEP requires a cryptographically secure random nonce. Generating high-quality randomness (e.g., from /dev/urandom) is slow (microseconds per call).
Hash Function Calls: OAEP uses a hash function (e.g., SHA-256) multiple times. For a 2048-bit modulus, the hash might be called 2-3 times. Each hash call is fast (nanoseconds), but it adds to the total time.

Signatures: PSS (Probabilistic Signature Scheme)

For digital signatures, RSA uses PSS instead of OAEP. PSS is similarly complex, involving random nonces and mask generation functions. It ensures that a signature is unforgeable.

Why Padding Matters for Performance

The critical insight: Padding makes the message the same size as the modulus. Even if your plaintext is 10 bytes (a credit card number), OAEP pads it to 2048 bits (256 bytes) before encryption. This means that decryption always involves a full-length exponentiation. You cannot “short-circuit” the process if the message is small. This is a fundamental property of RSA: the data size is the modulus size.

Comparison to ECC

Elliptic Curve Cryptography (ECC) does not have this padding problem. For ECDH (key exchange), you exchange points on the curve, not padded messages. For ECDSA (signatures), the signature is based on the hash, not the entire message. ECC signatures are typically 512-1024 bits long, regardless of the underlying key size (256-bit key yields 512-bit signature). This makes ECC more bandwidth-efficient and, in some sense, faster for signing.

Section VII: The Performance-Security Trade-off — A Deep Analysis

We have established that RSA is inherently asymmetric in performance (encryption is fast, decryption is slow) and that key size is the primary driver of cost. But the relationship between key size and security is not linear.

Key Size vs. Security (NIST Estimates)

2048-bit RSA: Equivalent to ~112 bits of symmetric security. Estimated secure until ~2030.
3072-bit RSA: Equivalent to ~128 bits of symmetric security. Estimated secure until ~2058.
4096-bit RSA: Equivalent to ~192 bits of symmetric security. Estimated secure for the foreseeable future, but very expensive.

The Cost of Doubling Key Size

What happens when you go from 2048-bit keys to 4096-bit keys?

The exponent size doubles (2048 bits to 4096 bits).
The modulus size doubles.
The number of limbs doubles (32 to 64).

Using our cost model:

Exponentiation cost scales as ( O(log e) \times O(k^2) ).
Doubling the key size: ( k ) doubles, ( \log e ) doubles.
Cost becomes ( 2 \times (2^2) = 8x ) slower. In reality, due to the overhead of the arithmetic, it’s often a factor of 6-8x slower.

Practical Web Server Impact

If a 2048-bit RSA handshake takes 0.8 ms on a core, a 4096-bit handshake will take 5-6 ms. That is significant. For a server handling 10,000 new TLS connections per second (a high-traffic site), the CPU time spent on the handshake alone jumps from 8 seconds per second (8 cores) to ~50 seconds per second (50 cores). This is impossible. This is why most websites use 2048-bit (or 3072-bit) certificates and migrate to ECDSA for the server authentication.

Side-Channel Attacks: The Hidden Cost of Optimization

Performance optimizations can open side-channel vulnerabilities. We already mentioned the CRT fault attack. Another is the timing attack.

Constant-Time Comparison: When verifying a signature, you must compare the computed hash to the expected hash. If you use memcmp() or a loop that stops at the first mismatched byte, an attacker can measure the timing of the comparison and deduce the correct signature byte-by-byte.
Montgomery Multiplication and Branches: Early Montgomery multiplication implementations used conditional branches to handle carries and reductions. These branches could leak timing information.
Solution: All critical cryptographic code must be written in constant-time. Branches are replaced with bitwise operations. This is hard to write and can be slightly slower (e.g., by preventing the CPU from using branch prediction).

Hardware Acceleration

To mitigate performance issues, modern CPUs include specialized instructions:

x86-64: MULX, ADOX, ADCX (from the BMI2/ADX extensions) allow for more efficient multiplication and addition loops.
ARMv8-A: Crypto Extensions include dedicated instructions for RSA and ECC (e.g., SM3, SM4, AESE).
Intel QAT (QuickAssist Technology): Dedicated hardware accelerators for RSA and ECC. They offload the entire decryption/signing operation from the CPU. This is used in high-end network appliances and cloud servers (e.g., AWS Nitro).

The presence of hardware acceleration has allowed RSA to persist. Without it, the performance gap with ECC would be even more stark.

Section VIII: The Future — ECC, Post-Quantum, and the Slow Death of RSA

RSA is not dead, but its decline is inevitable. The performance asymmetry, the large key sizes, and the rise of mobile and IoT devices have made ECC the preferred choice for new systems.

The ECC Advantage

Key Size: A 256-bit ECC key provides equivalent security to a 3072-bit RSA key. The operations are on 256-bit numbers, not 3072-bit numbers. This results in dramatically faster key generation, signing, and key exchange (typically 2-5x faster for signing, 10-20x faster for key generation).
No Padding Overhead: ECDH key exchange does not require padding. ECDSA signatures are small.
Elliptic Curve Diffie-Hellman (ECDHE) is now the standard for TLS key exchange. It is fast, forward-secret, and resistant to many attacks.

Why RSA Still Exists

Despite ECC’s advantages, RSA is stubbornly persistent. Why?

Legacy: Billions of devices have embedded RSA public keys (in firmware, in smart cards). Replacing them is a multi-decade effort.
Compatibility: Older clients (e.g., Internet Explorer on Windows XP, some embedded browsers) do not support ECC certificates.
Trust in Complexity: Some cryptographers mistrust the complexity of ECC (the security of the discrete log on elliptic curves is less well-understood than factoring). RSA is “conservative.”
Digital Signatures: RSA still dominates for signing large batches of data (e.g., code signing, software updates) because verification (public exponent) is very fast.

The Death Knell: Post-Quantum Cryptography

The final nail in the RSA coffin will be the arrival of large-scale quantum computers. Shor’s algorithm will efficiently factor large numbers, breaking RSA completely. Post-quantum cryptographic algorithms (like CRYSTALS-Kyber for key exchange and CRYSTALS-Dilithium for signatures) will replace both RSA and ECC.

The Transition Timeline

2024-2030: ECC becomes dominant for new protocols. RSA is relegated to fallback and legacy.
2030-2040: Post-quantum standards are deployed. RSA support is disabled in new TLS versions.
2040+: RSA becomes a historical curiosity, like DES.

Conclusion: The Quiet Giant’s Legacy

RSA was the first practical public-key cryptosystem. It took a beautiful mathematical idea—the hardness of factoring—and turned it into the backbone of internet security. It survived the transition from 512-bit keys to 4096-bit keys, from software-only implementations to hardware accelerators. It endured attacks on its math, its implementation, and its side channels. It is a testament to the power of algorithmic optimization (CRT, Montgomery, Karatsuba) and sound engineering.

But the story of RSA is also a cautionary tale. It reminds us that performance and security are inextricably linked. A fast cryptosystem is not necessarily a secure one (textbook RSA). A secure one is not necessarily fast (4096-bit decryption). The future of cryptography will be driven by these trade-offs, and the best systems will be those that are mathematically sound, algorithmically efficient, and resilient against the next generation of threats.

When you next see that padlock icon, take a moment to appreciate the silent work happening inside your CPU: a thousand-line algorithm, a multi-million-cycle computation, a mathematical impossibility made practical. That is the legacy of the quiet giant.