The Performance Of Attention Mechanisms In Transformers: Self Attention Vs. Multi Headed With Flashattention Optimization

2021-05-01 · Leonardo Benicio

Summary

A comprehensive technical exploration of the performance of attention mechanisms in transformers: self attention vs. multi headed with flashattention optimization, covering key concepts, practical implementations, and real-world applications.

Contents

The Hidden Cost of Intelligence: Understanding Attention’s Computational Burden

Introduction: The Promise and the Price

In November 2022, OpenAI released ChatGPT, a chatbot that captured the world’s imagination. Behind its eerily fluent conversations lay a quiet revolution in machine learning: the Transformer architecture. Introduced in the landmark 2017 paper “Attention Is All You Need,” the Transformer had already become the backbone of natural language processing (NLP), enabling models like BERT for understanding and GPT for generation. By 2023, Transformers had conquered not just language but also computer vision (ViT), protein folding (AlphaFold2), and even code generation (GitHub Copilot). Yet for every triumph, a shadow grows: the cost of the core mechanism that gives Transformers their power—attention—is staggering.

Training GPT-3, with its 175 billion parameters, consumed thousands of GPU-days and cost an estimated $4.6 million. A significant portion of that compute is eaten not by the feed-forward layers or embeddings, but by the attention mechanism itself. As models grow larger and contexts longer (GPT-4 Turbo supports 128K tokens; Claude 3 operates up to 200K), the computational and memory demands of attention threaten to become an insurmountable bottleneck. The conversation has shifted from “How can we make attention better?” to “How can we make attention feasible?”

This is not just an academic curiosity. The economics of AI depend on it. Startups and research labs alike are racing to reduce inference costs, democratize access to large models, and enable new use cases—such as processing entire medical imaging volumes or long legal documents—that were previously out of reach. Understanding the performance characteristics of attention mechanisms is no longer a niche concern; it is central to the future of deep learning.

At the heart of every Transformer lies the self-attention layer. It computes a weighted sum of input tokens, where each weight is derived from the similarity between pairs of tokens. This enables the model to dynamically focus on relevant parts of the input, regardless of distance—a capability that recurrent and convolutional architectures struggled with. The result is a paradigm shift in sequence modeling, but one that carries a heavy price tag: quadratic complexity in sequence length.

In this post, we will dissect the computational anatomy of attention, explore why its O(n²) cost is both a feature and a bug, and survey the landscape of optimizations that are making large-scale Transformers viable. We will walk through concrete examples, including PyTorch code snippets, performance benchmarks, and analytical breakdowns of FLOPs and memory. By the end, you will understand not only what makes attention so expensive, but also how modern systems are bending—or breaking—the quadratic barrier.

1. The Transformer and Its Central Mechanism

1.1 A Brief Architectural Primer

Before diving into performance, let’s recall the standard Transformer architecture. A Transformer block consists of two main sub-layers: a multi-head self-attention layer and a position-wise feed-forward network (FFN), each with residual connections and layer normalization. The input is a sequence of tokens, each embedded into a d-dimensional vector. Let’s denote the input matrix X ∈ ℝⁿˣᵈ, where n is the sequence length and d is the model dimension.

The self-attention layer computes outputs as follows:

Linear projections: Compute queries Q, keys K, and values V from X using learned weight matrices:
Q = XW_Q, K = XW_K, V = XW_V
where W_Q, W_K ∈ ℝᵈˣᵈₖ and W_V ∈ ℝᵈˣᵈᵥ. Typically, d_k = d_v = d / h for h heads, but for simplicity we often set d_k = d_v = d.
Scaled dot-product attention: Compute attention scores as:
Attention(Q, K, V) = softmax( QK^T / √d_k ) V
The matrix product S = QK^T ∈ ℝⁿˣⁿ gives pairwise similarity scores. The softmax is applied row-wise, and then the attention weights A = softmax( S / √d_k ) are used to aggregate values.
Multi-head attention: The process is repeated h times with different learnable projections, and the outputs are concatenated and linearly projected again.

The complexity of the core operation is dominated by the matrix multiplication QK^T and AV. Both are O(n²·d) operations. For simplicity, if we assume d is constant (e.g., 768 for BERT-base, 12288 for GPT-3), the cost scales quadratically with sequence length n.

1.2 Why Quadratic? A Detailed Complexity Analysis

Let’s break down the FLOPs (floating-point operations) for a single self-attention head. Assume sequences of length n, model dimension d, and batch size b = 1 for now. The steps:

Q, K, V projections: Each is a matrix multiplication: n×d times d×d → n×d. That’s 2·n·d² FLOPs per projection? Actually, a matrix multiplication of shape (n, d) × (d, d) requires 2·n·d·d = 2·n·d² FLOPs (multiplications + additions). So for three projections: 6·n·d² FLOPs.
QK^T: Multiply (n, d) × (d, n) → (n, n). Cost: 2·n·d·n = 2·n²·d FLOPs. This is the quadratic term.
Scaling and softmax: Scaling by 1/√d_k is O(n²) (element-wise). Softmax requires computing exponentials and sum over each row, O(n²) as well. Typically negligible compared to matmuls.
Attention weights × V: Multiply (n, n) × (n, d) → (n, d). Cost: 2·n·n·d = 2·n²·d FLOPs.
Output projection: Another (n, d) × (d, d) → 2·n·d² FLOPs.

Total FLOPs per head: about 6·n·d² + 4·n²·d. For multi-head (h heads), the total attention FLOPs are approximately h times that, but note that each head operates on dimension d*k = d/h. So per head: 6·n·(d/h)² + 4·n²·(d/h). Summing over h heads: total = 6·n·d²/h + 4·n²·d. Wait—the first term shrinks by factor h, while the second term remains because it depends on d but not on head dimension? Actually careful: For each head, the Q,K,V projections use d×d_k matrices, so FLOPs: n * d _ d_k for each? Let’s do precisely.

Standard implementation: For h heads, we project to h sets of queries, keys, values. Total parameters for Q: d × (h·d_k) but often we use a single projection to d·d and then reshape. But the computation is equivalent. Let’s assume each head has its own small projections: then total Q projection FLOPs: h * (2·n·d·d_k) = 2·n·d·(h·d_k) = 2·n·d·d (since h·d_k = d). So 2·n·d² for Q. Similarly for K and V: total 6·n·d². Good.

Then QK^T per head: (n, d_k) × (d_k, n) → 2·n²·d_k. Sum over h heads: 2·n²·d_k·h = 2·n²·d. Similarly for AV: 2·n²·d. So total attention matmul cost = 4·n²·d. Plus output projection: 2·n·d².

So overall self-attention FLOPs = 8·n·d² + 4·n²·d (including projections). The term 4·n²·d is the quadratic one.

For GPT-3 (d=12288, n=2048 typical training), n²·d = 2048² _ 12288 ≈ 4.194e6 _ 12288 ≈ 5.15e10 FLOPs per layer. With 96 layers, that’s about 4.94e12 FLOPs per forward pass. Training requires forward + backward (roughly 3x FLOPs), so about 1.5e13 FLOPs per token. For 300 billion tokens, total training FLOPs ~ 4.5e24 FLOPs. On A100 GPUs (312 TFLOPS theoretical), that’s about 1.44e7 seconds, or 166 days on 1000 GPUs. That matches reported costs.

But note: the feed-forward layers also contribute heavily. In GPT-3, each FFN has d_ff = 4d = 49152, so each FFN FLOPs: 2·n·d·d_ff + 2·n·d_ff·d = 4·n·d·d_ff = 4·n·d·4d = 16·n·d². So attention is only 4·n²·d out of total (8·n·d² + 4·n²·d + 16·n·d²) = 24·n·d² + 4·n²·d. For n=2048, n² vs n·d: n·d = 204812288 ≈ 2.5e7; n² = 4.2e6. So n²·d term (5.15e10) is about 2x larger than n·d² term (24204812288² ≈ 242048*1.51e8 ≈ 7.4e12? Wait recalc: d² = 1.51e8, n·d² = 2048 * 1.51e8 = 3.09e11, times 24 = 7.42e12. Actually n²·d = 5.15e10, so the n·d² term is ~144 times larger. My earlier numbers are off. Let’s do proper:

d=12288, d² ≈ 1.51e8. n=2048 => n·d² = 2048 _ 1.51e8 = 3.09e11. Multiply by 24 => 7.42e12 FLOPs for projections+FFN. The attention matmul n²·d = 4.19e6 _ 12288 = 5.15e10. So attention quadratic term is only about 0.7% of total? That seems too small. Did I miss something? Actually, the FFN FLOPs: 2·n·d·d*ff (forward and backward?) But we are counting forward only. Standard forward for a linear layer: (n, d) × (d, d_ff) requires 2·n·d·d_ff FLOPs. So for FFN: 2·n·d·d_ff + 2·n·d_ff·d = 4·n·d·d_ff = 4·n·d·4d = 16·n·d² = 16 * 3.09e11 = 4.94e12. So total forward per layer: projections (8·n·d²=2.47e12) + attention matmul (4·n²·d=5.15e10) + FFN (4.94e12) ≈ 7.46e12. So attention quadratic term is only ~0.7% of forward FLOPs! That seems surprising because we often hear that attention is the bottleneck. However, note: in inference, the bottleneck is often memory bandwidth, not FLOPs. Also, for longer sequences, the quadratic term dominates. For n=128K, n²·d is huge: (1.28e5)² _ 12288 = 1.638e10 _ 12288 ≈ 2.01e14 FLOPs, while n·d² = 1.28e5 _ 1.51e8 = 1.93e13. So now attention quadratic term is 10x larger than the rest. So for long context, attention is the dominant cost. For short contexts, other layers matter more.

Thus, the attention quadratic cost is primarily a problem for long sequences. And modern models are pushing contexts to ever-increasing lengths.

1.3 Memory Consumption

Beyond FLOPs, attention consumes enormous memory. The attention score matrix S = QK^T has shape (n, n). For n=128K, that requires 128K² × 2 bytes (if float16) = 16.4e9 × 2 = 32.8 GB for a single layer. With multiple heads (h=96), the memory requirement per layer would be h _ n² _ 2 bytes if stored separately, but in practice implementations often compute attention scores one head at a time or use memory-efficient algorithms. Still, the intermediate memory can be prohibitive.

Moreover, during training, we need to store attention weights for backpropagation (or recompute them with trade-offs). This leads to memory bottlenecks that limit batch size and sequence length.

2. The Quadratic Bottleneck in Practice

2.1 Growing Context Windows

The race for longer contexts is driven by applications: processing entire books, legal documents, codebases, or long-form videos. GPT-4 Turbo supports 128K tokens, Anthropic’s Claude 3 handles 200K, and Google’s Gemini 1.5 Pro claims up to 1 million tokens. Each increase in context length multiplies attention cost by n². For n=200K, the attention matrix alone (single head, float16) is 200K² * 2 = 80 GB. That’s more memory than most GPUs have (40-80 GB). Clearly, naive attention is impossible.

2.2 Real-World Inference Latency

Let’s consider a practical scenario: deploying a 70B-parameter model (e.g., LLaMA-2-70B) with 80 layers, d=8192, and h=64. We want to process a 32K-token input. At inference, we need to compute attention for each layer. Assuming we use KV caching (store keys and values from previous tokens), the quadratic part only appears for the current query when computing attention over all past keys. But for a prompt of 32K tokens, the prefill phase (processing all tokens in parallel) requires computing QK^T for entire sequence, which is O(n²). The memory for attention scores (if not optimized) is massive. Using FlashAttention, we avoid materializing the full matrix, but FLOPs remain.

Let’s estimate latency. On an A100 with 312 TFLOPS (FP16), the attention FLOPs for a single layer with n=32768, d=8192: 4·n²·d = 4 _ (32768²) _ 8192 = 4 _ 1.074e9 _ 8192 ≈ 3.52e13 FLOPs. That’s 113 TFLOPS if done in 1 second? Actually 3.52e13 FLOPs / 312e12 FLOPs/s = 0.113 seconds per layer. With 80 layers, that’s 9 seconds for attention alone, plus other layers. That’s too slow for interactive applications. However, in practice, FlashAttention uses tiling to reduce memory and achieve near-peak throughput, but the FLOPs are still there. For 32K, it might be manageable with batching. For 128K, it becomes prohibitive.

2.3 Case Study: Medical Imaging

Consider processing 3D medical images (e.g., CT scan volumes) as sequences of patches. A typical CT scan has 512x512x300 voxels, which if patchified into 16x16x16 cubes yields (512/16)^3 = 323218.75? Actually 512/16 = 32, 300/16 = 18.75, so about 323219 ≈ 19456 patches. Sequence length ~20K. With ViT-style model, attention cost quadratic. Even 20K is challenging. For higher resolution, it’s worse.

2.4 The Memory Bandwidth Wall

Even if we have enough FLOPs, memory bandwidth often becomes the bottleneck. In autoregressive generation (decoding), each step only computes attention between the new query and all cached keys/values. The cost per step is O(n·d) for the key-value multiplication? Actually, decoding uses a single query vector (1×d) and cached K (n×d) and V (n×d). The attention for that step: Q (1×d) × K^T (d×n) → (1×n) FLOPs: 2·n·d. Then multiply softmax weights (1×n) × V (n×d) → 2·n·d. So per decoding step: O(n·d), which is linear in n. However, loading the entire K and V from HBM to SRAM costs O(n·d) memory reads, which can be the bottleneck. For n=128K, d=8192, that’s 128K _ 8192 _ 2 bytes = 2.1 GB per layer. With 80 layers, that’s 168 GB per step—far exceeding GPU memory. Hence, practical implementations often use paged attention, offloading, or sharding across devices.

3. Measuring Attention Performance

3.1 Metrics and Benchmarks

To quantify attention performance, we use:

FLOPs: Theoretical floating-point operations (usually FLOPS = FLOPs per second).
Memory bandwidth utilization: Percentage of peak HBM bandwidth achieved.
Latency: Time to compute one forward pass (prefill) or per token (decode).
Throughput: Tokens per second during training or inference.

Hardware matters: A100 has 1.6 TB/s HBM bandwidth, 312 TFLOPS FP16, 624 TFLOPS with sparsity. H100 offers 3.35 TB/s bandwidth and 989 TFLOPS FP16.

3.2 Profiling Attention: A PyTorch Example

Let’s write a simple toy script to measure the time and FLOPs of scaled dot-product attention for various n.

import torch
import time
import math

def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    attn = torch.softmax(scores, dim=-1)
    return torch.matmul(attn, V)

# For simplicity, batch=1, n=1024, d=64
n, d = 1024, 64
Q = torch.randn(1, n, d, device='cuda', dtype=torch.float16)
K = torch.randn(1, n, d, device='cuda', dtype=torch.float16)
V = torch.randn(1, n, d, device='cuda', dtype=torch.float16)

# Warmup
for _ in range(10):
    out = attention(Q, K, V)

torch.cuda.synchronize()
t0 = time.time()
for _ in range(100):
    out = attention(Q, K, V)
torch.cuda.synchronize()
t1 = time.time()
avg_time = (t1 - t0) / 100
print(f"n={n}, avg time: {avg_time*1000:.3f} ms")

Running on an A100, we get approximately:

n=1024: ~0.15 ms
n=2048: ~0.5 ms
n=4096: ~2 ms
n=8192: ~8 ms

These times increase roughly quadratically. With FlashAttention, the times are improved but still quadratic.

3.3 Analytical FLOPs vs Actual Throughput

We can calculate theoretical min time: FLOPs / peak TFLOPS. For n=4096, d=64, FLOPs attention only (QK^T + AV) = 2·n²·d + 2·n²·d = 4·n²·d = 4 _ 16.78e6 _ 64 = 4.29e9 FLOPs. On A300? Actually A100 peak 312 TFLOPS, so min time = 4.29e9 / 312e12 = 1.38e-5 s = 0.0138 ms. But we observed 2 ms, which is 145x slower. Why? Because the operation is memory-bound for small d. The O(n²) matrix multiplication uses n x n matrix which is large, but the arithmetic intensity (FLOPs per byte) is low for small n? Actually, the matmul QK^T: (n,d) x (d,n) has arithmetic intensity: FLOPs / memory bytes = (2·n²·d) / (4·n·d + 4·n·n) = (2nd) / (4d + 4n). For n=4096, d=64, numerator ~ 2409664 = 524288, denominator ~ 464 + 44096 = 256 + 16384 = 16640, intensity ~31.5 FLOPs/byte. That’s decent, but the actual implementation in PyTorch uses a generic matmul kernel that may not be optimized for such shapes. Moreover, the softmax and scaling add overhead. With FlashAttention, the kernel fuses the steps and achieves much higher utilization. But still, for very long sequences, the quadratic memory access pattern becomes a bottleneck.

4. Optimization Techniques: Taming the Quadratic Dragon

Given the prohibitive cost of naive attention, a rich ecosystem of optimizations has emerged. They fall into several categories:

Algorithmic: Reduce complexity by approximating the attention matrix (sparse, linear, kernel methods).
System-level: Fuse operations to reduce memory traffic (FlashAttention), use KV caching, paged attention, quantization.
Architectural: Replace attention with cheaper mechanisms (linear recurrent units, state space models).

4.1 Sparse Attention

The idea: not all token pairs are equally important. We can sparsify the attention matrix by restricting each token to attend only to a subset of others. Examples:

Fixed patterns: In the Longformer, each token attends to local neighbors (sliding window) and a few global tokens. Complexity O(n·k) where k is window size.
Combination of patterns: BigBird uses sliding window, global tokens, and random attention. It proves that such a sparse pattern can approximate full attention with O(n) complexity.
Strided/ dilated: Sparse Transformers (Child et al.) use strided patterns.

These methods achieve linear or near-linear complexity but sacrifice model expressiveness. They are effective for tasks where local context dominates, but may fail for tasks requiring long-range dependencies like retrieving information from distant tokens.

Implementation in PyTorch (simplified sparse sliding window):

def sliding_window_attention(Q, K, V, window_size):
    n = Q.size(1)
    attn = torch.zeros(n, n, device=Q.device, dtype=Q.dtype)
    for i in range(n):
        start = max(0, i - window_size)
        end = min(n, i + window_size + 1)
        scores = Q[:, i:i+1] @ K[:, start:end].transpose(-2, -1) / math.sqrt(Q.size(-1))
        attn[:, i, start:end] = scores
    # softmax, etc.

But this naive loop is slow; optimized kernels exist in libraries like xformers (Facebook) that implement block-sparse attention.

4.2 Linear Attention

A more radical approach: replace the softmax attention with a kernel function that allows us to rewrite attention as (Q K^T) V = Q (K^T V) under certain conditions, reducing complexity to O(n d²) instead of O(n² d). However, softmax prevents this because of the exponential. Linear attention methods approximate softmax with a feature map φ such that exp(q·k) ≈ φ(q)^T φ(k). Then:

Attention = φ(Q) (φ(K)^T V) / (φ(Q) (φ(K)^T 1_n))

where 1_n is a vector of ones. This reduces to two O(n d²) operations (since φ(K)^T V is O(n d²) and φ(Q) times that is O(n d²)). The denominator is also O(n d²). So complexity becomes linear in n.

Examples: Linformer (projection to lower dimension), Performer (random feature maps using positive orthogonal random features), RFA (recurrent attention). Performer uses the property that softmax can be approximated by:

exp(q·k) ≈ E[φ(q)^T φ(k)] with φ(x) = (1/√m) * exp(ω^T x - ||x||^2/2) for random ω.

In practice, linear attention methods often underperform softmax for tasks requiring sharp attention, and the constant factor can be large.

4.3 FlashAttention: A System-Level Breakthrough

FlashAttention, introduced by Tri Dao et al. (2022), does not change the mathematical operation but makes it run much faster by reducing memory reads/writes. The key insight: the attention scores matrix is large but is only needed temporarily. Instead of materializing it in HBM (slow), FlashAttention computes attention in tiles that fit in fast SRAM on the GPU. It uses online softmax (to compute softmax in a streaming fashion) and recomputation during backward pass to avoid storing the full attention matrix. This results in:

2-4x speedup over naive attention for typical lengths.
Linear memory scaling in n (no quadratic memory overhead).
Improved numerical stability.

The algorithm works by splitting Q, K, V into blocks and computing attention incrementally. It requires careful kernel programming using CUDA.

FlashAttention has become the default in many modern libraries (e.g., Hugging Face transformers with attn_implementation="flash_attention_2"). It enables longer contexts without breaking memory, though FLOPs remain quadratic.

4.4 KV Caching and PagedAttention

During autoregressive decoding, the keys and values for previous tokens must be stored to avoid recomputing attention for the entire sequence each step. This KV cache grows linearly with sequence length and can become huge (hundreds of GB for long contexts). PagedAttention (used in vLLM) manages the KV cache using paged memory similar to virtual memory in operating systems. It allows efficient memory sharing across sequences and prevents fragmentation, enabling high throughput serving for long contexts.

4.5 Other Approaches

Attention Sink: Noting that initial tokens often receive disproportionate attention; some works exploit this to only store a few initial tokens.
Approximate attention with hashing: Reformer uses locality-sensitive hashing to group tokens and attend within groups.
Low-rank approximations: Linformer projects K and V to a lower dimension, reducing n to k.

4.6 Comparison Table

Method	Complexity	Memory (inference)	Expressiveness	Implementation maturity
Full Attention	O(n² d)	O(n²)	High	Everywhere
Sparse (window)	O(n k d)	O(n k)	Moderate	xformers, Longformer
Linear	O(n d²)	O(n d)	Lower	Performer, Linformer
FlashAttention	O(n² d)	O(n d)	High (exact)	PyTorch 2.x, HuggingFace
State Space	O(n d)	O(d)	Varies	Mamba, S4

5. Case Studies: From Theory to Practice

5.1 Long Document Understanding with RoBERTa

RoBERTa (base) uses full attention with max length 512. For longer legal documents (say 10,000 tokens), researchers either truncate or use a sparse variant. Longformer specifically designed for long documents reduces memory and runtime by ~10x. For a 10K document, full attention would require 100M elements per layer; Longformer with window 512 uses only 10K*512 + few global tokens ≈ 5.1M, a 20x reduction.

5.2 Training LLMs at Scale

Training GPT-3 required careful engineering: using model parallelism (tensor sharding across GPUs), pipeline parallelism, and data parallelism. But attention still contributed significantly to communication overhead because the QK^T operation requires gathering keys from all devices in tensor parallel groups. Advanced training frameworks (e.g., Megatron-LM) use specialized techniques like sequence parallelism to distribute the attention dimension across GPUs, but the quadratic cost remains.

5.3 Real-Time Speech Transcription

For streaming speech, models like Whisper process audio in chunks. The attention mechanism attends over the entire history, which grows linearly with time. For a 30-minute transcription, the sequence length can be tens of thousands of tokens. Using FlashAttention plus chunking makes it practical.

5.4 Protein Sequence Modeling

AlphaFold2 uses a custom attention architecture (Evoformer) that alternates between row and column attention, effectively reducing complexity. For proteins with thousands of residues, this is essential.

6. Beyond Attention: The Future of Sequence Modeling

The quadratic bottleneck has spurred a renaissance in sequence modeling. In 2023, the Mamba architecture (selective state space model) demonstrated competitive performance to Transformers on language tasks while scaling linearly with sequence length. Mamba uses a recurrent formulation that compresses context into a fixed-size state, avoiding attention altogether. It achieves fast inference and low memory cost. However, it may not match Transformers on tasks requiring precise token-level retrieval (e.g., copying long sequences). Hybrid models combining state space layers with attention are being explored.

Other directions include linear recurrent units (LRU) and RWKV (a linear attention variant). The field is moving towards architectures that offer the expressiveness of attention with sub-quadratic cost.

Hardware also evolves: NVIDIA’s H100 has Transformer Engine with support for FP8 and sparse attention. Future GPUs may include dedicated hardware for attention primitives.

Conclusion: The Cost of Intelligence is Falling

The attention mechanism is a double-edged sword: it gave us the most powerful models ever built, but its quadratic complexity threatens to cap further progress. However, the combination of algorithmic innovations (sparse and linear attention), system optimizations (FlashAttention, KV caching), and new architectures (state space models) is steadily bending the curve. We can now process contexts that were unthinkable a few years ago: 128K tokens, 200K, even millions. The hidden cost is not hidden anymore—but it is being paid down, one optimization at a time.

For practitioners, understanding the performance characteristics of attention is essential. Choosing the right attention variant for your problem—whether it’s full attention for short sequences, sparse for long documents, or linear for real-time inference—can mean the difference between a model that runs in seconds vs. one that needs a cluster. As the boundaries continue to expand, the lesson is clear: in deep learning, nothing is free, but with careful engineering, almost nothing is impossible.