The Johnson-Lindenstrauss Lemma and the Geometry of High-Dimensional Data

Explore the surprising geometry of high-dimensional spaces: the Johnson-Lindenstrauss lemma showing that random projections preserve pairwise distances, the concentration phenomena that make it work, and its profound applications in nearest-neighbor search, compressed sensing, and machine learning.
High-dimensional space is deeply weird. If you grew up with Euclidean geometry in two and three dimensions, almost every intuition you have is wrong in dimension one thousand. The volume of a sphere concentrates near its surface. Two random vectors are almost certainly nearly orthogonal. The Gaussian distribution is not a bell-shaped cloud but an impossibly thin shell. And — most remarkably for algorithm designers — you can project a set of points from a million dimensions down to a few thousand, or even a few hundred, while preserving all pairwise distances to within a tiny multiplicative error. This is the Johnson-Lindenstrauss lemma, and it is one of the most useful theorems in all of computer science.
The JL lemma states, roughly: given any set of (n) points in (\mathbb{R}^d), there exists a linear map into (\mathbb{R}^k) where (k = O(\varepsilon^{-2} \log n)) such that all pairwise distances are preserved within a factor of (1 \pm \varepsilon). The target dimension (k) depends logarithmically on (n) and not at all on (d). You can have a billion points in a billion dimensions, and the JL lemma says you can embed them into a few thousand dimensions while losing almost no geometric information.
This post develops the JL lemma from first principles. We will explore the counterintuitive geometry of high-dimensional spaces, prove the JL lemma via random projections (Gaussian, sparse, and subgaussian), connect it to the Restricted Isometry Property (RIP) in compressed sensing, discuss Locality-Sensitive Hashing (LSH) as an algorithmic application, survey practical uses in approximate nearest-neighbor search and randomized linear algebra, and reflect on what the lemma tells us about the nature of high-dimensional data — namely, that the curse of dimensionality has a surprising antidote.
1. The Weirdness of High Dimensions
Before we prove the JL lemma, we need to understand why it is plausible. High-dimensional Euclidean space behaves in ways that defy our 3D intuition.
1.1 Volume Concentrates at the Surface
Consider a sphere of radius (R) in (\mathbb{R}^d). Its volume is (V_d(R) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} R^d). The ratio of the volume of a shell of thickness (\varepsilon R) (between (R - \varepsilon R) and (R)) to the total volume is:
[ \frac{V_d(R) - V_d(R(1 - \varepsilon))}{V_d(R)} = 1 - (1 - \varepsilon)^d \approx 1 - e^{-\varepsilon d} ]
For large (d), this approaches 1 exponentially fast. Virtually all the volume of a high-dimensional sphere is in a thin shell just inside its surface. There is no “interior” in the familiar sense.
1.2 Random Vectors Are Nearly Orthogonal
Draw two independent random vectors uniformly from the unit sphere in (\mathbb{R}^d). What is the distribution of their inner product (\langle \mathbf{u}, \mathbf{v} \rangle)?
By rotational symmetry, we can fix (\mathbf{u} = (1, 0, \ldots, 0)). Then (\langle \mathbf{u}, \mathbf{v} \rangle = v_1), the first coordinate of (\mathbf{v}). The distribution of (v_1) has mean 0 and variance (1/d). By concentration, (|v_1| \leq O(1/\sqrt{d})) with high probability. The angle between the vectors is (\arccos(v_1) \approx \pi/2 \pm O(1/\sqrt{d})).
In dimension (d = 10000), two random vectors have inner product at most about 0.01 with overwhelming probability. “Nearly orthogonal” is the default state in high dimensions. This is why random projections work: a random projection into a moderately high-dimensional subspace preserves the geometry of a set of points because the random basis vectors are all nearly orthogonal to each other and to the data.
1.3 The Gaussian Annulus Theorem
The standard Gaussian distribution (\mathcal{N}(0, I_d)) in (\mathbb{R}^d) has density proportional to (e^{-|\mathbf{x}|^2/2}). The distribution of the norm (|\mathbf{x}|) is the chi distribution with (d) degrees of freedom. Its mean is approximately (\sqrt{d}) and its variance is (O(1)). Most of the probability mass lies in an annulus of width (O(1)) at radius (\sqrt{d}).
For large (d), (\sqrt{d}) is much larger than the annulus width. The Gaussian distribution is essentially a uniform distribution over a thin spherical shell of radius (\sqrt{d}). This is the “Gaussian concentration” phenomenon that drives the JL lemma: a Gaussian random vector has almost all its norm concentrated at a specific radius, and its projection onto any fixed direction has variance 1, independent of dimension.
1.4 Concentration of Measure
All these phenomena are manifestations of the concentration of measure principle: for a Lipschitz function on a high-dimensional space, the function’s value is tightly concentrated around its median. Formally, if (f: \mathbb{S}^{d-1} \to \mathbb{R}) is 1-Lipschitz, then for a random point (\mathbf{x}) on the sphere:
[ \Pr[|f(\mathbf{x}) - \text{median}(f)| \geq t] \leq 2e^{-(d-1)t^2/2} ]
This is a remarkably strong statement: the deviation probability decays exponentially in the dimension (d). In high dimensions, “almost all points are almost indistinguishable” from the perspective of any Lipschitz statistic. The JL lemma can be viewed as an application of concentration of measure to the function (f(\mathbf{x}) = |\mathbf{x}|) after an appropriate random projection.
2. The Johnson-Lindenstrauss Lemma: Statement and Significance
The JL lemma was proved by William B. Johnson and Joram Lindenstrauss in 1984, originally as a tool in the geometry of Banach spaces. It has since become a foundational result in computer science.
2.1 Formal Statement
Lemma (Johnson-Lindenstrauss, 1984). For any (0 < \varepsilon < 1) and any set of (n) points (X \subset \mathbb{R}^d), there exists a linear map (f: \mathbb{R}^d \to \mathbb{R}^k) with (k = O(\varepsilon^{-2} \log n)) such that for all (\mathbf{u}, \mathbf{v} \in X):
[ (1 - \varepsilon) |\mathbf{u} - \mathbf{v}|^2 \leq |f(\mathbf{u}) - f(\mathbf{v})|^2 \leq (1 + \varepsilon) |\mathbf{u} - \mathbf{v}|^2 ]
The map (f) can be taken as a random projection: (f(\mathbf{x}) = \frac{1}{\sqrt{k}} A \mathbf{x}) where (A) is a (k \times d) random matrix with appropriately chosen entries. The probability that a random projection succeeds for all (O(n^2)) pairs is at least (1 - 1/n).
2.2 What the Lemma Says (and Does Not Say)
The JL lemma makes three remarkable claims:
The target dimension is independent of the original dimension (d). It depends only on the number of points (n) and the desired accuracy (\varepsilon). For (n = 10^6) and (\varepsilon = 0.1), we need (k \approx 100 \ln(10^6) \approx 1380) dimensions, regardless of whether (d = 10^4) or (d = 10^9).
The embedding is linear. A random linear map suffices. This is computationally efficient: computing the embedding requires only a matrix multiplication, which can be done in (O(kd)) time (and faster with sparse or structured random matrices).
The embedding is oblivious to the data. The random matrix (A) can be chosen without looking at (X). The same random map works for any set of (n) points with high probability.
What the JL lemma does NOT say:
- It does not say the embedding preserves structure beyond pairwise distances (angles, volumes, manifolds may not be preserved).
- It does not say (k) can be reduced further — the (\Omega(\varepsilon^{-2} \log n)) bound is known to be tight for linear embeddings (Larsen & Nelson, 2017).
- It does not say the original dimension (d) is irrelevant in all respects — if (d < k), the lemma is trivially satisfied (we can use the identity map).
2.3 Why This Is Astonishing
Imagine you have one million documents, each represented as a TF-IDF vector in a vocabulary of one million terms. Each document is a point in (\mathbb{R}^{1000000}). Computing all pairwise distances to find nearest neighbors would be prohibitively expensive. The JL lemma says: project these vectors into about 1400 dimensions using a random matrix, and all pairwise distances are preserved to within 10% error. Nearest neighbors in the projected space are approximately the same as nearest neighbors in the original space. The computational savings — from (O(n^2 d) = O(10^6 \times 10^6 \times 10^6)) to (O(n d k) = O(10^6 \times 10^6 \times 1400)) — can be the difference between impossible and trivial.
3. Proof via Gaussian Random Projections
We now prove the JL lemma using a Gaussian random matrix. This is the cleanest proof and illustrates all the essential ideas.
3.1 The Random Projection
Let (A) be a (k \times d) matrix where each entry is drawn independently from (\mathcal{N}(0, 1)). Define the random projection:
[ f(\mathbf{x}) = \frac{1}{\sqrt{k}} A \mathbf{x} ]
For any fixed vector (\mathbf{x} \in \mathbb{R}^d) (think of (\mathbf{x}) as the difference (\mathbf{u} - \mathbf{v}) between two points), we want to show that (|f(\mathbf{x})|^2 \approx |\mathbf{x}|^2) with high probability.
3.2 The Distribution of the Projected Norm
Let (\mathbf{x} \in \mathbb{R}^d) be a fixed non-zero vector. Consider (\mathbf{y} = A\mathbf{x} \in \mathbb{R}^k). Each coordinate (yi = \sum{j=1}^{d} A*{ij} x_j) is a linear combination of independent Gaussians. Since (A*{ij} \sim \mathcal{N}(0, 1)) are independent:
[ yi \sim \mathcal{N}\left(0, \sum{j=1}^{d} x_j^2\right) = \mathcal{N}(0, |\mathbf{x}|^2) ]
By scaling, we can assume (|\mathbf{x}| = 1), so (y_i \sim \mathcal{N}(0, 1)) i.i.d. Then:
[ |f(\mathbf{x})|^2 = \frac{1}{k} \sum_{i=1}^{k} y_i^2 ]
This is the average of (k) independent (\chi^2_1) random variables (each (y_i^2 \sim \chi^2_1)). The expected value is (\mathbb{E}[|f(\mathbf{x})|^2] = \frac{1}{k} \cdot k \cdot 1 = 1), and the variance is (\text{Var}(|f(\mathbf{x})|^2) = \frac{1}{k^2} \cdot k \cdot 2 = 2/k).
3.3 Concentration via the Chi-Square Tail Bound
We need a tail bound for the average of (k) i.i.d. (\chi^2_1) variables. A standard Chernoff-style bound (or the Laurent-Massart lemma for chi-square) gives:
[ \Pr\left[\frac{1}{k} \sum_{i=1}^{k} y_i^2 \geq 1 + \varepsilon\right] \leq e^{-k(\varepsilon^2 - \varepsilon^3)/4} ] [ \Pr\left[\frac{1}{k} \sum_{i=1}^{k} y_i^2 \leq 1 - \varepsilon\right] \leq e^{-k(\varepsilon^2 - \varepsilon^3)/4} ]
For (\varepsilon \in (0, 1/2]), we can use the cleaner bound:
[ \Pr\left[||f(\mathbf{x})|^2 - 1| \geq \varepsilon\right] \leq 2e^{-k\varepsilon^2/8} ]
3.4 Union Bound over All Pairs
There are ({n \choose 2} < n^2/2) pairs of points. Fix (\mathbf{x} = \mathbf{u} - \mathbf{v}) for each pair. By the union bound, the probability that any pair violates the (\varepsilon)-distortion guarantee is:
[ \Pr[\text{failure}] \leq {n \choose 2} \cdot 2e^{-k\varepsilon^2/8} \leq n^2 e^{-k\varepsilon^2/8} ]
Set this to be at most some small constant (\delta) (say, (\delta = 1/2)). Solve for (k):
[ k \geq \frac{8}{\varepsilon^2} \ln\left(\frac{2n^2}{\delta}\right) = \frac{8}{\varepsilon^2} (2\ln n + \ln(2/\delta)) = O\left(\frac{\log n}{\varepsilon^2}\right) ]
With (k = \lceil 20 \ln n / \varepsilon^2 \rceil) and (\delta = 2/n), the success probability is at least (1 - 1/n). This completes the proof.
3.5 Where the Logarithmic Dependence Comes From
The (\log n) factor arises from the union bound over (O(n^2)) pairs. Each pair’s distortion probability decays exponentially in (k\varepsilon^2). To make (n^2) such probabilities small simultaneously, we need (k\varepsilon^2 \gtrsim \log(n^2) = 2\log n). This is why the target dimension depends logarithmically on the number of points — and why, for large (n), the JL dimension is remarkably small compared to the original dimension.
4. Sparse and Subgaussian Random Projections
The Gaussian JL construction is elegant but requires (O(kd)) time and a dense random matrix. For large (d), we want faster embedding algorithms. Several variants achieve this.
4.1 The Achlioptas Sparse Projection
Dimitris Achlioptas (2003) showed that a sparse random matrix works: each entry (A_{ij}) is independently:
[ A_{ij} = \begin{cases} +1 & \text{with probability } 1/6 \ 0 & \text{with probability } 2/3 \ -1 & \text{with probability } 1/6 \end{cases} ]
This matrix is sparser (only (1/3) of entries are non-zero) and avoids floating-point multiplication (only addition and subtraction are needed). The JL guarantee still holds with the same (k = O(\varepsilon^{-2} \log n)), though the constants are slightly larger.
def achlioptas_projection(X, k):
n, d = X.shape
# Generate sparse random matrix
A = np.random.choice([-1, 0, 1], size=(k, d), p=[1/6, 2/3, 1/6])
return (1 / np.sqrt(k)) * (A @ X.T).T
4.2 Subgaussian Random Variables
The Gaussian proof relied on the fact that a linear combination of independent Gaussians is Gaussian, and the norm of a Gaussian vector concentrates. This generalizes to any subgaussian distribution — one whose tail decays at least as fast as a Gaussian. If each (A_{ij}) is an independent subgaussian random variable with mean 0 and variance 1, the same JL guarantee holds.
The key property is the Hanson-Wright inequality, which bounds the concentration of quadratic forms in subgaussian random variables. This allows us to show that (|A\mathbf{x}|^2) concentrates around its expectation (|\mathbf{x}|^2) for any subgaussian (A).
Common choices:
- Rademacher: (\pm 1) with probability (1/2) each (variance 1, bounded)
- Steinhaus: complex unit circle (for complex-valued projections)
- FastJL (Ailon & Chazelle, 2006): Use a sparse “Fast Fourier Transform-like” matrix that achieves (O(d \log d + k \log n)) embedding time.
4.3 The Subspace Johnson-Lindenstrauss
The standard JL lemma embeds a finite set of points. The subspace JL lemma (Sarlos, 2006) generalizes this: a random projection with (k = O(\varepsilon^{-2} \log n)) approximately preserves not just pairwise distances but the entire geometry of any (n)-dimensional subspace of (\mathbb{R}^d). More precisely, for any (n)-dimensional subspace (V \subset \mathbb{R}^d), with high probability:
[ (1 - \varepsilon) |\mathbf{x}| \leq |f(\mathbf{x})| \leq (1 + \varepsilon) |\mathbf{x}| \quad \forall \mathbf{x} \in V ]
This is the Restricted Isometry Property (RIP) familiar from compressed sensing, and it is proved using a union bound over an (\varepsilon)-net of the unit sphere in (V) (which has size roughly ((3/\varepsilon)^n)). The JL lemma is the special case where (V) is the span of the (n) difference vectors.
5. The Restricted Isometry Property and Compressed Sensing
The JL lemma is deeply connected to compressed sensing, a theory that allows reconstructing sparse signals from far fewer measurements than the Nyquist-Shannon sampling theorem requires.
5.1 The RIP Definition
A matrix (A \in \mathbb{R}^{k \times d}) satisfies the Restricted Isometry Property of order (s) with constant (\delta_s) if for all (s)-sparse vectors (\mathbf{x} \in \mathbb{R}^d) (those with at most (s) non-zero entries):
[ (1 - \delta_s) |\mathbf{x}|^2 \leq |A\mathbf{x}|^2 \leq (1 + \delta_s) |\mathbf{x}|^2 ]
The RIP says: (A) approximately preserves the norm of all sparse vectors. This is a stronger condition than the JL lemma, which only preserves norms of vectors in a fixed finite set. But the proof technique is the same: a random matrix (Gaussian, subgaussian) satisfies RIP with (k = O(s \log(d/s))) measurements, and the proof uses concentration of measure plus a union bound over an (\varepsilon)-net of the set of all (s)-sparse unit vectors.
5.2 From RIP to Signal Recovery
If (A) satisfies RIP of order (2s), then any (s)-sparse signal (\mathbf{x}^) can be recovered from measurements (\mathbf{y} = A\mathbf{x}^) by solving the convex optimization problem:
[ \min_{\mathbf{x}} |\mathbf{x}|_1 \quad \text{subject to} \quad A\mathbf{x} = \mathbf{y} ]
This is the basis of compressed sensing (Candès, Romberg, & Tao, 2006; Donoho, 2006). The (\ell_1) minimization recovers the sparsest solution, and RIP guarantees this is exactly (\mathbf{x}^*).
The connection to JL: both rely on the fact that random dimensionality reduction preserves the geometry of restricted sets of vectors. JL preserves distances among a finite set. RIP preserves norms over an infinite (but low-complexity) set of sparse vectors. Both are consequences of concentration of measure in high dimensions.
5.3 Why This Matters for Data Systems
Compressed sensing has revolutionized medical imaging (MRI), where reducing the number of measurements directly reduces scan time. It also appears in single-pixel cameras, radio astronomy, and seismic imaging. The intellectual thread — from the geometry of high-dimensional spaces, through JL and RIP, to practical signal acquisition — is one of the great success stories of applied mathematics in the 21st century.
6. Locality-Sensitive Hashing (LSH)
The JL lemma’s most direct algorithmic application is Locality-Sensitive Hashing, a technique for approximate nearest-neighbor search in high dimensions.
6.1 The Nearest-Neighbor Problem
Given a dataset of (n) points in (\mathbb{R}^d) and a query point (\mathbf{q}), find the closest point in the dataset. Exact nearest-neighbor search in high dimensions suffers from the curse of dimensionality: all known exact algorithms are essentially linear scans, requiring (O(nd)) time.
But for many applications, an approximate nearest neighbor suffices: find a point within a factor ((1 + \varepsilon)) of the true nearest distance. LSH solves this in sublinear time.
6.2 Hash Functions from Random Projections
An LSH family for Euclidean distance can be constructed from random projections:
[ h_{\mathbf{a}, b}(\mathbf{x}) = \left\lfloor \frac{\langle \mathbf{a}, \mathbf{x} \rangle + b}{w} \right\rfloor ]
where (\mathbf{a} \in \mathbb{R}^d) is a random Gaussian vector, (b \in [0, w)) is a random offset, and (w) is a “bucket width” parameter.
The key property: nearby points (small (|\mathbf{u} - \mathbf{v}|)) are likely to hash to the same bucket, while distant points are likely to hash to different buckets. Specifically:
[ \Pr[h(\mathbf{u}) = h(\mathbf{v})] = p(|\mathbf{u} - \mathbf{v}|) ]
where (p(\cdot)) is a decreasing function. The JL lemma guarantees that random projections preserve distances, which is why a distance-based hash function works.
6.3 Amplification
A single hash function has limited discrimination power. LSH amplifies by concatenating (L) independent hash tables (each using (m) independent hash functions):
- For each of (L) tables, compute a hash key by concatenating (m) random projection hashes.
- At query time, hash the query point into each table, retrieve all points in the matching buckets, and compute exact distances to those candidates.
- Return the closest candidate.
The parameters (L) and (m) control the trade-off between recall (fraction of true nearest neighbors found) and query time. This achieves sublinear query time (O(n^{\rho})) where (\rho < 1) depends on the approximation factor.
6.4 Practical LSH Implementations
LSH powers production nearest-neighbor systems:
- Spotify’s Annoy: Uses random projection trees (a variant of LSH) to find similar songs based on audio feature vectors.
- Google’s SCAAN: A learned index that accelerates nearest-neighbor search using compressed representations inspired by JL projections.
- FAISS (Facebook/Meta): Includes LSH as one of several indexing strategies for billion-scale similarity search.
The practical lesson: the JL lemma is not just a theoretical curiosity. It is the reason why approximate nearest-neighbor search is fast enough to power recommendation systems, image retrieval, and semantic search at internet scale.
7. Randomized Numerical Linear Algebra
The JL lemma has sparked a broader revolution: randomized numerical linear algebra (RandNLA). The core idea: random projections can dramatically accelerate matrix computations while providing provable error guarantees.
7.1 Randomized SVD
Given a matrix (M \in \mathbb{R}^{m \times n}), computing its singular value decomposition takes (O(mn \min(m, n))) time. For large matrices, this is prohibitive. Randomized SVD (Halko, Martinsson, & Tropp, 2011) uses a JL matrix to reduce the dimensionality before computing the SVD:
- Generate a random matrix (\Omega \in \mathbb{R}^{n \times k}) where (k) is slightly larger than the desired rank (r) (e.g., (k = r + 10)).
- Compute the sketch (Y = M\Omega \in \mathbb{R}^{m \times k}).
- Orthogonalize (Y) to get a basis (Q \in \mathbb{R}^{m \times k}) for the range of (M).
- Compute (B = Q^T M \in \mathbb{R}^{k \times n}) and SVD of (B) (small).
- Recover approximate singular vectors of (M) from those of (B).
The JL lemma (or more precisely, the subspace JL property) guarantees that (Q) captures the dominant singular directions of (M) with high probability. The computational savings are substantial: for a (10^6 \times 10^6) matrix of rank 100, randomized SVD can be 10-100x faster than deterministic methods.
def randomized_svd(M, k, n_oversamples=10):
m, n = M.shape
# Step 1: Random projection
Omega = np.random.randn(n, k + n_oversamples)
Y = M @ Omega # m x (k+p)
# Step 2: Orthogonalize
Q, _ = np.linalg.qr(Y) # m x (k+p)
# Step 3: Project M onto Q's space
B = Q.T @ M # (k+p) x n
# Step 4: SVD of small matrix
U_tilde, S, Vt = np.linalg.svd(B, full_matrices=False)
# Step 5: Recover
U = Q @ U_tilde # m x (k+p)
return U[:, :k], S[:k], Vt[:k, :]
7.2 Sketching in Streaming Algorithms
In the streaming model, data arrives as a sequence and we have limited memory (sublinear in the stream length). The JL lemma provides the foundation for sketching algorithms that maintain a compact summary (a “sketch”) of a high-dimensional vector or matrix under updates.
For example, the CountSketch (Charikar, Chen, & Farach-Colton, 2002) uses a sparse random projection to estimate the frequency of items in a stream with (O(\varepsilon^{-2} \log n)) space. The analysis directly invokes the JL lemma: the sketch preserves the (\ell_2) norm of the frequency vector (and thus the heavy hitters) with (1 \pm \varepsilon) error.
7.3 The Broader Impact
Randomized linear algebra has transformed how we compute with large datasets. Libraries like scikit-learn use randomized SVD for PCA. Apache Spark’s MLlib uses randomized methods for dimensionality reduction. The intellectual thread from JL to practical linear algebra is a reminder that the best theoretical ideas often have the greatest practical impact — even if the path from lemma to library takes decades.
One particularly elegant application is the Fast Johnson-Lindenstrauss Transform (FJLT) of Ailon and Chazelle (2006), which achieves (O(d \log d + k \log n)) embedding time by composing a sparse random matrix with a structured matrix derived from the Hadamard transform. The key insight is that the Hadamard transform “spreads out” the mass of a sparse vector before the random projection, ensuring that the projection captures information from all coordinates without needing a dense matrix. For high-dimensional sparse data (like text TF-IDF vectors), this reduces the embedding cost from (O(k d)) to (O(d \log d)), which for (d = 10^6) and (k \approx 10^3) is a two-orders-of-magnitude speedup. Libraries like the Randomized Linear Algebra Toolkit (RLAT) and Facebook’s FBCNN provide GPU-accelerated FJLT implementations used in production recommendation pipelines.
8. Open Problems and Modern Advances
The JL lemma continues to inspire new research, both in tightening the bounds and in extending the result to new settings.
8.1 Optimality of the JL Bound
The (\Omega(\varepsilon^{-2} \log n)) lower bound on the target dimension for linear JL embeddings was proved by Larsen and Nelson (2017), resolving a long-standing open problem. Their proof uses a delicate combinatorial construction: a set of (n) points whose pairwise distance matrix is the adjacency matrix of a specific expander graph, which forces any linear JL embedding to use at least the claimed dimension. The result is tight, meaning the JL lemma cannot be improved in its dependence on (\varepsilon) or (n) for linear embeddings.
For non-linear embeddings, the optimal dimension remains open: there is a gap between the known linear lower bound and the possibility that non-linear maps could achieve (o(\varepsilon^{-2} \log n)). In practice, non-linear methods like neural networks can sometimes achieve better empirical compression than linear projections, but proving optimality for non-linear embeddings is a major open problem. A breakthrough here would have implications for the representational power of deep learning — it could establish a fundamental limit on how much neural networks can compress data while preserving distances.
8.2 Database-Friendly Random Projections
The sparse JL constructions (Achlioptas, 2003) improved computational efficiency but still required dense output vectors. “Database-friendly” JL embeddings aim for sparsity in both the projection matrix and the output. The CountSparse embedding achieves this with a sparse random matrix and has been used for privacy-preserving data release and sketching.
A related thread is feature hashing (or the “hashing trick”), where high-dimensional categorical features are mapped to a lower-dimensional vector via a hash function. Feature hashing can be viewed as a sparse JL embedding where the random matrix has exactly one non-zero entry per column. The JL guarantee is weaker (more variance), but the computational simplicity makes it widely used in machine learning pipelines.
8.3 The Johnson-Lindenstrauss Transform in Neural Networks
Recent work has explored connections between JL and neural network theory. A randomly initialized wide neural network at initialization approximates a kernel machine, and the JL lemma can be used to analyze the concentration of the kernel matrix. Moreover, the “information bottleneck” principle suggests that neural networks learn compressed representations, and the JL lemma provides a lower bound on the achievable compression while preserving task-relevant geometry.
An intriguing open question: do trained neural networks achieve compression factors approaching the JL bound? There is some evidence that the representations learned by deep networks for classification tasks have dimensionality closer to the JL bound than to the intrinsic dimensionality of the data, suggesting that neural networks implicitly exploit the kind of structure that JL formalizes.
9. DIY: Experiments to Build Intuition
The best way to internalize the JL lemma is to experiment with it. Here are three small experiments you can run in any environment with NumPy.
9.1 Verify the Distortion
import numpy as np
def jl_experiment(n=1000, d=10000, eps=0.1):
# Generate random points
X = np.random.randn(n, d)
# Compute k from JL bound
k = int(20 * np.log(n) / (eps ** 2))
print(f"Target dimension: k = {k} (original d = {d})")
# Gaussian random projection
A = np.random.randn(k, d) / np.sqrt(k)
X_proj = (A @ X.T).T # n x k
# Compare pairwise distances
max_distortion = 0.0
for i in range(min(100, n)):
for j in range(i+1, min(100, n)):
d_orig = np.linalg.norm(X[i] - X[j])
d_proj = np.linalg.norm(X_proj[i] - X_proj[j])
distortion = abs(d_proj / d_orig - 1.0)
max_distortion = max(max_distortion, distortion)
print(f"Maximum distortion (over sampled pairs): {max_distortion:.4f}")
print(f"Is it ≤ eps ({eps})? {'Yes' if max_distortion <= eps else 'No'}")
9.2 The Logarithmic Dependence on n
Fix (d = 10^5), (\varepsilon = 0.2). Vary (n) from (10^2) to (10^5) and plot (k) vs. (\log n). Observe the linear relationship. Then fix (n = 10^4) and vary (\varepsilon) from (0.05) to (0.5); observe the (1/\varepsilon^2) scaling of (k).
9.3 Nearest-Neighbor Preservation
Generate (n = 5000) points in (d = 2000) dimensions. For each point, find its 10 nearest neighbors in the original space. Then project to various target dimensions (k \in {10, 50, 100, 200, 500}) and check what fraction of the 10 nearest neighbors are preserved (still in the top 10 in the projected space). You will observe that preservation degrades gracefully — (k = 100) might preserve 70% of neighbors, while (k = 500) preserves 95%.
9.4 Testing the Limits: When Does JL Break?
Push the JL lemma to its breaking point by setting (k) deliberately too small. For (n = 1000) points and (d = 1000), set (k = 5) (far below the JL bound of about 200 for (\varepsilon = 0.1)). Observe the maximum distortion — you should see values of 0.5 or larger, meaning distances are distorted by 50% or more. This vividly demonstrates that the JL bound is not overly conservative: when you go below it, the geometry genuinely degrades. The logarithmic dependence is generous but not infinite; you still need a meaningful number of projection dimensions.
10. Conclusion: Geometry as Computation
The Johnson-Lindenstrauss lemma is a theorem about geometry, but its deepest implications are computational. It tells us that high-dimensional data is far more compressible than naive counting would suggest. The “intrinsic dimension” of a dataset — measured not by the number of coordinates but by the minimal dimension needed to preserve its structure — can be exponentially smaller than the ambient dimension.
This has philosophical resonance. The world presents itself to us in high dimensions — every pixel in an image, every word in a vocabulary, every sensor in a network. But the meaningful structure is low-dimensional. The JL lemma is a mathematical articulation of the hope that we can find that low-dimensional structure efficiently, without understanding it, by simply projecting randomly enough times. That randomness should yield structure is one of the beautiful paradoxes of high-dimensional probability.
For the practitioner, the takeaways are concrete:
- When working with high-dimensional data, try random projection before complex dimensionality reduction. Gaussian random projection is fast, parameter-free (beyond choosing (k)), and comes with the JL guarantee. PCA, t-SNE, and UMAP are more powerful for visualization, but random projection is often sufficient for downstream tasks like classification and clustering.
- Set (k) using the JL formula as a starting point. (k = 2 \log n / \varepsilon^2) for Gaussian projections, or (k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)) for the Achlioptas sparse variant. These are conservative; in practice, smaller (k) often works, but the formula gives a theoretically justified upper bound.
- Remember that JL preserves Euclidean distances, not necessarily other structures. If your application depends on angles (cosine similarity), consider normalizing your vectors first (so Euclidean distance and cosine similarity are equivalent). If it depends on manifold structure, JL may not help.
- The magic is in the logarithm. The (\log n) factor means you can handle exponentially growing datasets with linearly growing projection dimension. One million points need only about 20x the projection dimension of one thousand points.
The Johnson-Lindenstrauss lemma, like many of the deepest results in computer science, reveals that the apparent complexity of high-dimensional data is an illusion. Most dimensions are empty. A few random directions capture almost everything that matters. And that, in the end, is what makes computation in high dimensions possible at all.
As a final reflection: there is a parallel between the JL lemma and the concept of “attention” in transformer architectures. The JL lemma tells us that random projections into a logarithmic number of dimensions suffice to preserve all pairwise relationships among (n) points. Transformers, with their quadratic (O(n^2)) attention mechanism, are computationally expensive precisely because they refuse to project — they insist on computing all pairwise similarities exactly. The quest for efficient attention (linear transformers, performers, reformer) can be viewed as an attempt to apply JL-like insights to deep learning: approximate the full attention matrix using low-dimensional sketches. Whether JL-style random projections can replace quadratic attention without sacrificing model quality is an active and exciting research frontier. The geometry of high-dimensional spaces, it seems, still has things to teach us about making neural networks both smarter and faster.