The primary content of dimension_reduction.py is the function performPCA, a flexible and efficient implementation of the Principal Component Analysis (PCA) algorithm. The general idea of PCA is that it allows one to take high-dimensional data sets and quantitatively describe the relevance of the various features; given a threshold of minimum explained variance one can use PCA to project this high-dimensional data down to a lower dimensional space in a way that retains the overall structure of the data set.
Suppose we have data points x1,x2,...,xn sitting in some p-dimensional ambient space. Rather than considering these points separately, we can store each data point as a row of a single n-by-p matrix X in the following way:
The general goal of PCA is to fit a hyperplane to the data points in a way which minimizes the total squared orthogonal distance between the points and the hyperplane. For each dimension parameter D between 1 and p, the PCA algorithm returns the orthonormal basis vectors (called "principal directions") spanning the D-dimensional hyperplane of best fit. More generally one might want to weight the error contributions of each point differently; in this case we can select weights w1,w2,...,wn corresponding to each data point and store them as diagonal entries of an n-by-n matrix W:
There are several algorithms which allow us to compute the basis vectors of our hyperplanes. However, given the specific goals I have in mind for this project it is important that we use the Singular Value Decomposition (SVD) method instead of the more standard covariance method. The reason for this is that, instead of assuming the hyperplane passes through the center of mass of the data points, I want to allow for the possibility of forcing the plane to pass through a specified point denoted here as x0 = (x0,1,...,x0,p). As such, the "Thin SVD" algorithm allows us to find (1) an n-by-p semi-orthonormal matrix U, (2) a p-by-p diagonal matrix Σ of decreasing singular values, and (3) a p-by-p orthonormal matrix V satisfying the following:
From here, we can simply take first D columns of the matrix V to be the basis vectors spanning our hyperplane of best fit. As an instructive example of typical PCA results, consider the following figure generated by the dimension_reduction_tests.py script in the unit_tests repository:
Beyond simply providing the principal directions, the PCA algorithm allows us to quantify the "explained variance" of each direction. The concept of explained variance is a deep one but thankfully it has a helpful geometric interpretation. Suppose we construct a line passing through our center point x0 and parallel to our chosen principal direction. We define the distance between an original data point and its projection onto this line as the “perpendicular component” (i.e. the length of the dotted lines in the below figures). Similarly, we define the distance between the projected data point and x0 to be the “parallel component” (i.e. the distance along the line from the red dot in the below figures). Now consider how the projections of the ellipse onto the blue and green principal directions differ in regards to parallel and perpendicular components:
There are additional interpretations of the parallel and perpendicular components that shine light on other terms commonly used to describe them:
Once the original data points have been projected onto the line, we can think of these projected points as being sampled from a 1-dimensional random variable. In the case that the center point x0 is the center of mass of the data set, the mean squared parallel component is simply the variance of this random variable. In the more general case of an arbitrary center point, the mean squared parallel component can be described as the 2nd moment of the variable about the chosen center point. As such, "variance" is often used to describe the parallel component
In the context of machine learning and orthogonal regression, one can think of our line as a line of best fit through the data set. Through this framing one can see that the mean squared perpendicular component is simply the mean squared error of such a fit. This is the reason why the term "error" is often used to describe the perpendicular component
Because orthogonal projection is used for defining the parallel and perpendicular components, there is a natural relationship between the two of them given by the Pythagorean Theorem. In particular, this represents a decomposition of the vector connecting the center point to the data point:
Keeping in mind that each principal direction vector in V is a unit vector, the equation for parallel component needed here is given by the following:
The reason that the term "explained variance" is used for PCA has to do with the two projection figures above. See how the blue points in the left figure have low error and high variance, while in contrast the green points in the right figure have high error and low variance? This terminology is specifically used here to reference the fact that blue points are spread further away from the central red point than the green points tend to be. With this in mind, the percent explained variance associated with the jth principal direction can be given by the following equation of weights and lengths:
Although this equation is nice in that it is has a clear geometric motivation with nice associated visuals, it turns out that there is computationally simpler way to obtain explained variance using the singular values σ1,...,σp found previously in the above matrix Σ. The reasons as to why are left to the "concept of explained variance is a deep one" comment above, but for now here is the simplified equation which uses singular values:
The idea here is that we think of these data points as being noisily sampled from a collection of high dimensional objects and we want to be able to estimate the dimension of the object from which each point is taken. More generally, we can estimate the dimension associated with any specific point in our our ambient space by looking at the "influence" various points in our data would have on it.
For the sake of the persistent dimension estimation algorithm, it suffices to assume that x0 is selected as one of the data points x1,x2,...,xn. The distance from each point in our data set to this point x0 can then be computed using the standard L2-norm in the following way:
We can determine how much influence each point has on x0 using the softmax function. In general we want points near x0 to be more heavily weighted than points further away, thus allowing us to examine the behavior of data points in a local neighborhood of our selected point. More specifically, for a given "softmax distance" parameter d0 we compute the weights w1,w2,...,wn stored in the n-by-n weight matrix W as follows:
From here, we compute the relevant information using the performPCA function summarized above. The resulting singular values σ1 ≥ σ2 ≥ … ≥ σp allow us to define the marginal (resp. cumulative) explained variance mj (resp. of cj) of each PCA basis vector vj in the following way:
Artificially setting c0 = 0 allows us to define a piecewise linear function D:[0, 1] → [0, p] taking a cumulative explained variance as an input and giving an estimated dimension as an output. More specifically:
The idea of including "persistence" is that we can examine the range of softmax distances for which a given point will be assigned a particular dimension estimate; the wider the range the more likely it is to be a valid estimate. This is very similar to the notion of barcodes from persistent homology. In that context, spheres of a given radius are drawn around each data point and the intersections of these spheres is used to generate a simplicial complex. Topological features of the complex are then shown as a function of the sphere radius using a barcode diagram; bars that exist for a wider range of radii are considered more significant and less likely to just be noise.
The persistent dimension estimation algorithm presented in this project takes a similar approach. The idea of this algorithmic generalization is to simply allow the softmax parameter d0 to be a continuous variable that is used as an input. In practice we will end up selecting a finite discrete sampling of distances for which computations are handled and we can then linearly interpolate to fill in the gaps; this will only provide an approximation of the correct dimension estimate for some values of d0, but as we will see this method ends up having a fairly small level of error.
The main goal of this page is to outline the persistent dimension estimation algorithm in an abstract setting. However, the pages listed below serve as more concrete examples of how the algorithm can be used. "Dimension Demo: Part 1" looks at how estimated dimension results can appear across a variety of data sets in the case that softmax distance is constant. In contrast, "Dimension Demo: Part 2" is a deep dive into the persistence aspect of the algorithm and is an examination into the implications of certain softmax distance choices across the data sets.
A general overview of a novel pointwise dimension estimation algorithm for analyzing high dimensional data sets
An extension of the pointwise dimension estimation algorithm which draws from the principles found in persistent homology