Principal Component Analysis (PCA) is a powerful tool for reducing high dimensional data to its essential features. However, PCA is fundamentally a global algorithm which doesn't take into account more localized aspects of the data. The idea of this project is to estimate the local dimension of high dimensional data sets using a modified version of PCA. Note: For a more complete background on PCA, please refer to the Dimensional Analysis page.
Suppose we have data points x1,x2,...,xn sitting in some p-dimensional ambient space. Rather than considering these points separately, we can store each data point as a row of a single n-by-p matrix X in the following way:
The idea here is that we think of these data points as being noisily sampled from a collection of high dimensional objects and we want to be able to estimate the dimension of the object from which each point is taken. More generally, we can estimate the dimension associated with any specific point in our our ambient space by looking at the "influence" various points in our data would have on it.
To clarify this idea, suppose our specified point in p-dimensional space is denoted as x0 = (x0,1,...,x0,p). Again, in theory x0 can be any point in the ambient space but in practice we will iteratively take it to be each of the points in the data set itself. In the following way we can then compute the distance from each point in our data to this point x0 using the standard L2-norm:
We can determine how much influence each point has on x0 using the softmax function. More specifically, for a given "softmax distance" parameter d0 we can compute the weights w1,w2,...,wn and associated n-by-n weight matrix W as follows:
From here we simply perform weighted principal component analysis on the data. However, because we are shifting the data by x0 rather than the center of mass of the data set, it is important that we use the Singular Value Decomposition (SVD) method instead of the more standard covariance method. The "Thin SVD" algorithm allows us to find (1) an n-by-p semi-orthonormal matrix U, (2) a p-by-p diagonal matrix Σ of decreasing singular values, and (3) a p-by-p orthonormal matrix V satisfying the following:
Given that the diagonal entries of Σ satisfy the ordering σ1 ≥ σ2 ≥ … ≥ σp, the marginal (resp. cumulative) explained variance mj (resp. of cj) of each PCA basis vector vj is given in the following way:
Artificially setting c0 = 0 allows us to define a piecewise linear function D:[0, 1] → [0, p] taking a cumulative explained variance as an input and giving an estimated dimension as an output. More specifically:
The goal of the following examples is to demonstrate that it is possible to recover the correct dimension of artificially created data sets. In this case, 1 and 2-dimensional objects were embedded in 5-dimensional space and then perturbed with uniformly sampled noise at a maximum scale of ±0.05 in all dimensions. At least in the examples shown below, a softmax distance of 0.8 and cumulative explained variance of 75% tends to correctly estimate the dimension of the underlying object; further study is needed to see how well this holds for a wider variety of data.
This unit circle consists of 500 randomly generated points
A sufficiently small neighborhood of each point could be considered as a 2-dimensional region due to the presence of noise in the data
Similarly, the traditional PCA algorithm would indicate that the vast majority of the variance would be explained using a 2-dimensional model
However, this method using a softmax distance of 0.8 does a good job of verifying that a circle can be parametrized by a only single parameter
This unit disk consists of 500 randomly generated points
Generally points in the interior of a disk are surrounded locally by a small 2-dimensional neighborhood
However, it is interesting that a disk (being a manifold with boundary) ends up with a lower estimated dimension near said boundary and a higher estimated dimension more toward the middle
Note: Overall the estimated dimensions are too low for this object, in practice one would likely want to adjust the parameters to extract the correct values
This unit sphere consists of 500 randomly generated points
Despite the lower overall density of the point cloud, the same softmax distance and explained variance values as the other examples still manage to estimate the correct dimension for this data set
Note: The topic of sample density is covered in more depth on the page Dimension Demo: Part 2
These merged unit circle and sphere data sets consist of a total of 1000 randomly generated points
The main idea here is that the algorithm is able to distinguish between points contained in the lower dimensional circle and the higher dimensional sphere
Note: The region of the circle near the intersection of the two shapes does end up having an overestimated dimension; this is an artifact of using a value as large as 0.8 for the softmax distance
Main algorithms from the dimensional_analysis folder of the infrastructure repository
PCA algorithm from dimension_reduction.py
Dimension estimation from persistent_dimension.py
Specific demonstration figures from the dimension_demo folder of the active_projects repository
Circle demo from circle_data_demo.py
Disk demo from disk_data_demo.py
Sphere demo from sphere_data_demo.py
Merged demo from merged_data_demo.py
Additional basic functionality from the common_needs folder of the infrastructure repository
Custom RGB spectrum from color_helper.py
SQL functionality from sqlite3_helper.py
File dialogs from tkinter_helper.py
Object type checks from type_helper.py