Learning in High Dimension Always Amounts to Extrapolation
In this post I’ll attempt to shed some light on the conclusion that is drawn (in part1) from the above image:
We shouldn’t use interpolation/extrapolation in the way the terms are defined below when talking about generalization, because for high dimensional data deep learning models always have to extrapolate, regardless of the dimension of the underlying data manifold.
The image is taken from Balestriero, Pesenti, and LeCun, 2021 and the goal of this post is to reproduce it. In the process of doing that, we hopefully get a better understanding of what the interpolation/extrapolation debate (here and here and here) is about. To be honest, the whole debate is not going to be any clearer after reading this post. David Hume describes quite nicely what is probably going on in this discussion in his “an enquiry concerning human understanding”:
It might reasonably be expected, in questions, which have been canvassed and disputed with great eagerness, since the first origin of science and philosophy, that the meaning of all terms, at least, should have been agreed upon among the disputants; and our enquiries, in the course of two thousand years, have been able to pass from words to the true and real subject of the controversy. For how easy may it seem to give exact definitions of the terms employed in reasoning, and make these definitions, not the mere sounds of words, the object of future scrutiny and examination? But if we consider the matter more narrowly, we shall be apt to draw a quite opposite conclusion. From this circumstance alone, that a controversy has been long kept on foot, and remains still undecided, we may presume, that there is some ambiguity in the expression, and that the disputants affix different ideas to the terms employed in the controversy.
Basically, it seems like a lot of arguments between people are a result of not properly defining what is discussed (and the last author on this paper probably agrees). I never read Hume, but adding a fancy philosopher quote is a necessary step towards creating the illusion that I know what I’m talking about. In reality, the reason I’m writing this blogpost is because I didn’t understand the Learning in High Dimension paper at all on first reading. With this post I hope to get a better understanding of it, and share that understanding.
At the end this post, we will hopefully know more about the following terms:
- The curse of dimensionality
- Convex hull
- Ambient dimension
- Intrinsic dimension / data manifold dimension
- Interpolation / extrapolation
The Key Idea
The key idea in this paper is that we shouldn’t be using interpolation and extrapolation (in the way the terms are defined in this paper) as indicators of generalization performance, because models are likely extrapolating. That is true even for data manifolds with moderate intrinsic dimensions. What counts is the dimensionality of the data representation, which is often much higher than the intrinsic dimension. The authors think researchers that equate generalization with extrapolation should develop appropriate geometrical definitions of extrapolation that actually align with generalization in the context of high dimensional data. So let’s dive in!
When are we interpolating?
In this section: Interpolation, Extrapolation, Convex Hull, Convex Combination, Linear Combination, Curse of Dimensionality
The definition of interpolation in the paper is the following:
Definition 1. Interpolation occurs for a sample \(\mathbf{x}\) whenever this sample belongs to the convex hull of a set of samples \(\mathbf{X} \triangleq \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\), if not, extrapolation occurs.
So what is the convex hull of a set of samples?
A vector \(\mathbf{x}\) lies within the convex hull of a set of samples \(\mathbf{x}_1, \dots, \mathbf{x}_N\) if we can write it as a convex combination of the samples:
\[\mathbf{x} = \lambda_1 \mathbf{x}_1 + \dots + \lambda_N \mathbf{x}_N\]Subject to the constraints: \(\lambda_i \geq 0\) and \(\sum_i \lambda_i = 1\). Let’s have a look at what that means in a dimension we can still visualize.
Everything within (or on) this square is a convex combination of the convex hull of the four samples, everything outside it but still in \(\mathbb{R}^2\) is a linear combination of the samples, relaxing the constraints on the \(\lambda_i\)’s. Note that we can easily calculate the probability of lying within the convex hull of samples here if we assume that all values will lie between zero and three. Let’s say we sample points uniformly between zero and three for both dimensions, then the probability that a new sample lies within the convex hull of the four ‘training’ samples is the area of the convex hull divided by the total area:
\[p(\mathbf{x} \in \text{Hull}(\mathbf{X})) = \frac{1}{9}\]In general, for a convex hull with area \(c\) in \(\mathbb{R}^2\) and data points in \([x_0, x_1]\) the probability of lying in the convex hull of the four training samples is \(\frac{c}{(x_1 - x_0)^2}\).
Now let’s see what happens if we move to three dimensions. We’ll stretch out the square from above into a cube and calculate the probability of lying within this cube.
This cube has a volume of one, and the probability of lying within this cube has become:
\[p(\mathbf{x} \in \text{Hull}(\mathbf{X})) = \frac{1}{27}\]In general, for a convex hull with volume \(c\) in \(\mathbb{R}^3\) and data points in \([x_0, x_1]\) the probability of lying in the convex hull is \(\frac{c}{(x_1 - x_0)^3}\). This provides some intuition as to why the probability of lying within the convex hull (i.e., being in interpolation regime) quickly goes down as the dimensionality of the problem goes up – exponentially quickly.
This intuition is formalized in the paper by a theorem describing the limiting behaviour of the probability of lying in the convex hull for new (i.i.d.) samples from a Gaussian.
Theorem 1 (Baranay and Furedi, 1988). Given a \(d\)-dimensional dataset \(\mathbf{X} \triangleq \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\) with i.i.d. samples \(\mathbf{x}_n \sim \mathcal{N}(0, \mathbb{I}_d)\), for all \(n\), the probability that a new sample \(\mathbf{x} \sim \mathcal{N}(0, \mathbb{I}_d)\) is in interpolation regime (recall Def. 1) has the following limiting behavior:
\[\lim_{d \rightarrow \infty} p(\mathbf{x} \in \text{Hull}(\mathbf{X})) = \begin{cases} 1 & \text{if } N > d^{-1}2^{d/2} \\ 0 & \text{if } N < d^{-1}2^{d/2} \end{cases}\]This theorem says that the probability that a new sample lies in the convex hull of \(N\) samples
tends to one for increasing dimensions if \(N > d^{-1}2^{d/2}\), but if \(N\) is smaller, it tends to 0. This means
we need to exponentially increase the number of datapoints \(N\) for increasing \(d\) if we want to have a chance at
being in interpolation regime. See below for a plot of the minimal \(N\) needed
per dimension \(d\) (\(N = d^{-1}2^{d/2}\)).
In this section we learned what the convex hull of datapoints is, and what it means to be interpolating or extrapolating w.r.t. the convex hull of samples. We got some intuition about the curse of dimensionality, and might already see why it is pretty unlikely for a new sample to be in interpolation regime of real-world datasets. We are ready to start reproducing the first plot.
Going further, we should keep in mind that the following sentences are equivalent:
- The new sample lies in the convex hull of training samples.
- The new sample is within interpolation regime of the training samples.
- The new sample is a convex combination of training samples.
Reproducing the First Plot
In this section: Ambient Dimension
This image is a stack of six plots. Each plot shows the estimated probability that a new sample from a Gaussian with dimension \(d\) will lie in the convex hull of \(N\) training samples of this Gaussian. For example, for the bottom plot the estimated probability that a new sample \(\mathbf{x} \sim \mathcal{N}(0, \mathbb{I}_2)\) lies in the convex hull of the \(N = 10\) samples \(\mathbf{x}_1, \dots, \mathbf{x}_{10} \sim \mathcal{N}(0, \mathbb{I}_2)\) is roughly 50%. For a gaussian with dimensionality 7 we already need more than \(10^{2.4} \approx 250\) training examples to have roughly 50% chance of a new sample being in the convex hull of those training examples. The black line through the six plots is the line that denotes how \(N\) should change to keep the probability of a new sample being in the convex hull of 50% for increasing dimension \(d\). This shows that the necessary number of datapoints \(N\) increases exponentially with \(d\), because it’s a straight line through a logarithmic scale.
This dimension \(d\) of the Gaussian is called the ambient dimension. The ambient dimension of the data is the number of dimensions we use to represent it. If we take the well-known MNIST dataset as an example, the ambient dimension that is often used is the number of pixels: \(d = 28 \times 28 = 784\).
How to get the estimate of being in the convex hull?
We want to reproduce the above image, so how do we estimate the probability of being in the convex hull for a new sample?
The authors of the paper use a Monte-Carlo estimate. For each of the six plots above, we can do the following to get a single
point on the line. Sample \(N\) points from a Gaussian with
dimension \(d\) to form a training set of samples: \(\mathbf{x}_1, \dots, \mathbf{x}_{N}\).
Then, sample a new point \(\mathbf{x} \sim \mathcal{N}(0, \mathbb{I}_d)\)
and determine whether it lies in the convex hull of the training set. Repeat this whole thing 500,000 times, giving 500,000 binary
decisions whether the sample was inside the hull or not. The average of this gives the Monte-Carlo estimate of the probability that a new sample lies
in the convex hull of the training set – a single point on the plot. To get the
entire plot we need to do this for \(N\) between 1 and \(10^3 = 1000\) for every \(d\). You can imagine that this
will take a while, so we will cheat the estimate slightly and only sample the convex hull once for every of the 500,000 trials
we do per value for \(N\). Additionally, we only take \(10\) different values for \(N\) for each \(d\).
Below, we’ll implement everything to do this. We are going to need a few functions:
sample_from_gaussian
: a function to sample from a multivariate Gaussian.is_in_convex_hull_batch
: returns vector of boolean values that specify whether each sample in a batch of new vectors fall inside the convex hull.probability_in_convex_hull_batch
: returns the estimated probability that a new sample lies in the convex hull of training samples.
Above you see a three overlapping histograms plotted, one for each of the dimensions of the sampled Gaussians.
Probability of lying in the convex hull
Recall that a vector \(\mathbf{x}\) lies within the convex hull spanned by samples \(\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^d\)
if we can write it as a convex combination of the samples \(\mathbf{x} = \lambda_1\mathbf{x}_1 + \dots + \lambda_N\mathbf{x}_N\)
subject to \(\lambda_i \geq 0\) and \(\sum_i \lambda_i = 1\).
Now to find out whether a new point \(\mathbf{x}\) is in the convex hull we just need to verify whether it is a convex combination of the hull samples. This can be framed as a quadratic programming problem, and that’s what we will do for the second plot below, but for the first plot we can use a very simple batched implementation using SciPy.
The QP problem is necessary for the second plot since the method used below does not work for degenerate convex hull spaces, which will be the case for the second plot (lower intrinsic dimension than ambient dimension). Don’t worry if these terms don’t mean anything to you yet, I’ll explain them later.
Below we use a batched implementation of finding whether points lie in the convex hull by relying on scipy.Delaunay.
Let’s see if it works for the convex hull we visualized above in \(\mathbb{R}^2\). In addition to the point inside and the point outside the hull above, let’s add two more points outside the hull to get a quarter of the points inside the hull.
>> Probability inside hull (batch method): 0.25
Now we are ready to reproduce the first plot. We sample \(N\) points for an ambient dimension \(d\) from a multivariate
Gaussian \(\mathcal{N}(\mathbf{0}, \mathbb{I}_d)\). These points form the convex hull. Then we sample 500,000 new points (num_trials
)
from the same Gaussian, and see whether they fall within the hull, or outside, to estimate the probability of being inside the convex hull.
We do this for different \(N\) per dimensions \(d \in \{2, \dots, 7\}\). The N below (in num_convex_hull_samples_per_dim
)
are chosen to roughly be the same as in the plot (I eyeballed it).
NB: note that the code below runs reasonably quick for dimensions 2 to 6, but takes a few minutes for dimensions = 7.
>> Working on 2-D
>> Working on 3-D
>> Working on 4-D
>> Working on 5-D
>> Working on 6-D
>> Working on 7-D
Open to see the function plot_probabilities() to plot this (very uninteresting).
Jippie2! There it is! So beautiful. Very reproduce. Much similar.
Reproducing the Second Plot
In this section: Intrinsic Dimension, Data Manifold, Quadratic Programming
This plot looks incredibly similar to the first one. The difference lies in the underlying data manifold that is sampled from, i.e. the intrinsic dimension (denoted by \(\bar{d}\)) will be different. For the first plot, we sampled from a \(d\)-dimensional Gaussian. The intrinsic dimension of this Gaussian was the same as the dimension we used to represent it: \(d = \bar{d}\). For this plot, we sample 1-dimensional data and represent it in a higher dimension (\(d \in \{2, \ldots, 7\}\)). In each case the representation takes the form of a nonlinear differentiable manifold (\(\bar{d} = 1\)). The point of the plot is to show that even for data with a 1-dimensional manifold, if the ambient dimension goes up, you will still need exponentially more data to stay in interpolation regime with a high probability.
So how are we going to sample from a nonlinear differentiable 1-dimensional data manifold with a higher dimension \(d\)? We’ll take the following steps.
- Sample \(d\) independent Gaussian points.
- Interpolate between these points with a spline to get a 1D data manifold.
- Sample from a uniform distribution between 0 and 1 and put these through the spline to get samples.
These samples are now taken from a spline \(f: \mathbb{R} \rightarrow \mathbb{R}^{d}\), meaning the ambient dimension will be \(d\), whereas the intrinsic dimension is 1.
Why would this work?
This works because \(d\) independent samples will almost surely span all of \(\mathbb{R}^d\), so that the dimension of the samples and, hence, the resulting spline is \(d\).
Now we can sample from 1-dimensional manifolds in any ambient dimension. We can’t visualize dimensions above 3, but we can look at the first two components of the spline. Below let’s sample from the manifold in 2 and 10 dimensions.
In the above image one can see that the spline represented in 10 dimensions on the right is much more wobbly than the one that has ambient dimension 2; both have intrinsic dimension 1.
Does it lie in the convex hull? A QP problem
Unfortunately, the batched method for finding out whether a new point lies inside the convex hull doesn’t work anymore. This is because our samples all come from a 1D manifold, which causes geometric degeneracy. From the SciPy docs:
QhullError
Raised when Qhull encounters an error condition, such as geometrical degeneracy
when options to resolve are not enabled.
Instead, we will frame it as a quadratic programming problem3 and solve it with a QP solver that can handle degeneracy.
We want to find the coefficients such that \(\boldsymbol{\lambda}_1 \mathbf{x}_1 + \dots + \boldsymbol{\lambda}_N \mathbf{x}_N = \mathbf{x}\). The constraints on the \(\boldsymbol{\lambda}\)’s also need to be taken care of. If we can find coefficients that satisfy these constraints, the new point \(\mathbf{x}\) lies in the convex hull.
Recall what we have already defined so far:
- \(N\) the number of convex hull samples
- \(d\) the ambient dimension
- \(\boldsymbol{\lambda} \in \mathbb{R}^{N}\) the convex combination coefficients
- the convex hull samples \(\mathbf{X} \in \mathbb{R}^{N \times d}\)
- the new sample \(\mathbf{x} \in \mathbb{R}^d\)
Our quadratic program problem is the following:
\[\begin{align} \text{min}_{\boldsymbol{\lambda}} &\left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right)^T \left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right) \\ s.t. & \sum_i \lambda_i = 1 \\ & \lambda_i \geq 0 \end{align}\]If we solve this QP problem, and \(\left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right)^T \left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right) = 0\), it means we have \(\mathbf{x} = \mathbf{X}^T\boldsymbol{\lambda}\), and our new sample is a convex combination of the samples, i.e. lies in the convex hull.
We can rewrite the objective to the standard form as follows:
\[\begin{align} \text{min}_{\boldsymbol{\lambda}} &\frac{1}{2}\boldsymbol{\lambda}^T(\mathbf{X}\mathbf{X}^T)\boldsymbol{\lambda} + (-\mathbf{X}\mathbf{y})^T\boldsymbol{\lambda} \end{align}\]This follows from the fact that multiplying by a constant doesn’t change the objective, and neither does a constant addition. For the QP library that we are going to use, we need to rewrite the inequality constraints into the form \(G\boldsymbol{\lambda} \leq h\). To this end we define the matrix \(G\) as follows:
\[G = \begin{pmatrix} -1 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0\\ 0 & 0 & -1 & 0\\ 0 & 0 & 0 & -1 \end{pmatrix}\]and the vector \(h\):
\[h = \begin{pmatrix} 0\\ 0\\ 0\\ 0 \end{pmatrix}\]The equality constraint needs to be written into the form \(A\lambda = b\):
\[A = \begin{pmatrix} 1 & 1 & 1 & 1 \\ \end{pmatrix}\]and the vector \(b\):
\[b = \begin{pmatrix} 1 \end{pmatrix}\]This ensures that the sum of the \(\lambda_i\) equals 1 and all entries are non-negative.
Below we implement this, and we assume a new vector is in the complex hull if the QP problem has a ‘loss’ smaller than 1e-5 (meaning \(\left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right)^T \left(\mathbf{X}^T\boldsymbol{\lambda} - \mathbf{x}\right) \leq 1e-5\)).
Note the definitions of the following variables in the code below to keep it the same as in the example that I used.
\[\mathbf{P} := \mathbf{X}\mathbf{X}^T\] \[\mathbf{q} := -\mathbf{X}\mathbf{y}\]Let’s test it again for the example points and convex hull that we also used above (probability should be \(\frac{1}{4}\)).
>> Probability inside hull (quadratic programming method): 0.25
We’re ready to reproduce the figure. We take the following steps to do so:
- Sample a spline for a dimension \(d \in \{2, \ldots, 7\}\).
- Get \(N\) convex hull samples from this same spline.
- Get a new sample from the spline.
- See whether it’s in the convex hull of the \(N\) samples by solving the quadratic programming problem.
- Repeat
num_trials
times.
CAVEAT: The code below is incredibly slow, because of the new method of finding whether a point is in the convex hull used; framing it as a QP problem. The paper does 500,000 trials, but below we do just 100, to be a bit faster (runs in a few minutes). It still works, but obviously the estimate is much worse.
>> Working on 2-D
>> Working on 3-D
>> Working on 4-D
>> Working on 5-D
>> Working on 6-D
>> Working on 7-D
We can see that even though we cheated by only sampling the data points once per ambient dimension, and only doing 100 Monte-Carlo trials instead of 500,000, we get away with it!
Reproducing the Final Plot
In this section: Convex Hull Dimension
The final plot requires the introduction of some new terminology; it’s the plot that answers the question: “so how can we control the probability of interpolating, i.e. of a new sample lying in the convex hull”? The answer is (somewhat unsurprisingly now); by keeping the dimensions of the convex hull fixed. The convex hull dimension, also referred to as the dimension of the lowest dimensional affine subspace including the entire data manifold in the paper, is the dimension of the convex hull of the data points. Recall the discussion of what the convex hull is above? Below on the left we take two of those datapoints, that happen to lie on a line in \(\mathbb{R}^3\), so the ambient dimension is 3, but the convex hull dimension is 1; it’s a line. If we take 4 points that lie on a plane, like in the middle below, we have ambient dimension of 3 and convex hull dimension of 2. Finally, if we take the full set of 8 data points, we get a convex hull dimension of 3, which equals the ambient dimension. This convex hull dimension is denoted by \(d^*\) in the paper.
Now what we do to reproduce the final plot, is show that if we keep the convex hull dimension \(d^*\) similar, but increase the ambient dimension, it doesn’t have the effect it had before anymore! We don’t need more datapoints \(N\) to keep the probability of a new sample lying in the convex hull at 50% for a higher ambient dimension. The paper uses a convex hull dimension of \(d^* = 4\) and ambient dimensions of \(d \in \{5, 6, 7\}\)
How to control the convex hull dimension, while increasing the ambient dimension?
We’ll use the same intuition about Gaussians as before. To get 4-dimensional data we sample 4 independent Gaussian data points, these points will very likely have a convex hull dimension of 4. To see why, imagine the 3D plots from above. If we sample two data points in this 3D space, we can get at most a convex hull dimension of 1 (a line), then if we sample another point and it happens to be an extension of this line, the convex hull dimension won’t increase, it’ll still be a line. The chances of this happening are zero, so the dimension will increase to 2, a plane. Then if you sample another point, it will again not be an extension of the plane. If we generalize this intuition to higher dimensions, we obtain that the probability of sampling a new point from a Gaussian that lies on the convex hull is zero (whenever the dimension of the convex hull is small than the dimension of the ambient space).
To increase the ambient dimension without increasing the convex hull dimension, we can simply multiply the data points of a lower dimension with an invertible matrix of the desired dimension. This way we won’t increase the convex hull dimension (since \(rank(AB) = \min(rank(A), rank(B))\)), but we will increase the ambient dimension. We’re basically just embedding a 4D gaussian into a higher dimensional space.
For example, we have \(N\) samples from the 4D Gaussian \(\mathbf{X} \in \mathbb{R}^{N \times 4}\) (with \(rank(\mathbf{X}) = 4\)), and we want to embed these in an ambient dimension of \(d=7\). We add 3 columns of zeros to \(\mathbf{X}\) to get \(\hat{\mathbf{X}} \in \mathbb{R}^{N \times 7}\) (still with \(rank(\hat{\mathbf{X}}) = 4\)), multiply it with the sampled invertible matrix \(\mathbf{A} \in \mathbb{R}^{7 \times 7}\), giving us the final samples \(\mathbf{X}^{\star} \in \mathbb{R}^{N \times 7}\) (with \(rank(\mathbf{X}^{\star}) = rank(\hat{\mathbf{X}}\mathbf{A}) = \min(rank(\hat{\mathbf{X}}), rank(\mathbf{A})) = 4\)).
To get a random invertible matrix of \(d\times d\) we just sample \(d\) times from a \(d\)-dimensional Gaussian again, which guarantees to give an invertible matrix (with the same intuition as above). Let’s implement it!
Now we’re finally ready to reproduce the final figure.
>> Working on 5-D
>> Working on 6-D
>> Working on 7-D
And we’re done! If it’s also a thorn in your eye that the colors of these plots don’t match, go check out the code for this blogpost for yourself and change it!
Final Ponderings
What we learned now that no matter what the intrinsic dimension of your underlying data manifold is, for increasing convex hull dimensions we need exponentially more data to have any chance of being in interpolation regime. Now at this point, that doesn’t sound too surprising anymore, since we defined interpolation regime as the convex hull. But that’s precisely what this paper has achieved, it illuminated some of the terms that were previously undefined in the debate, and made them unsurprising. Of course you might say that everything is very different for real-world data and for expressive deep learning models that have a much better behaved representation space for this data. What the authors show however in the paper beyond the figure we reproduced here, is that the same conclusions hold. Despite the likely lower dimensional intrinsic data manifold of natural images, finding samples in interpolation regime becomes exponentially difficult with respect to the considered ambient data dimension. Current deep learning methods almost surely operate in an extrapolation regime in both the data and their embedding space. Now of course, like the authors of this paper also acknowledge, this doesn’t mean that deep learning methods are perfect. What definition for interpolation/extrapolation should we then take if we are talking about generalization performance? Perhaps here there’s one answer from another researcher. I would be interested to see more. If you have comments or questions, don’t hesitate to reach out on my twitter.
Acknowledgements
If you want to be a computer scientist like me and still be able to understand concepts from mathematics, take the following steps:
- Meet a cute boy at a houseparty
- Find out he’s doing a PhD in maths
- Motivate said cute boy / maths PhD to become your boyfriend
- Quarantine together during a global pandemic
- Pester him with questions about convex hulls
- ???
- Profit
Additionally, I’m grateful for comments from Randall Balestriero, Robert Kirk, and Aryaman Fasciati.
Sources
Randall Balestriero and Jerome Pesenti and Yann LeCun (2021). Learning in High Dimension Always Amounts to Extrapolation
Barany, I. and Furedi, Z. (1988). On the shape of the convex hull of random points.
-
The experiments that mostly endorse this conclusion are the experiments on real datasets in the paper. ↩
-
Jippie means “Yippie” in Dutch. ↩
-
You could also frame it as a linear programming problem. ↩