Principal Component Analysis (PCA) is a powerful tool for reducing high dimensional data to its essential features. However, PCA is fundamentally a global algorithm which doesn't take into account more localized aspects of the data. The idea of this project is to estimate the local dimension of high dimensional data sets using a modified version of PCA.
Suppose we have data points x1,x2,...,xn sitting in some p-dimensional ambient space. Rather than considering these points separately, we can store each data point as a row of a single n-by-p matrix X in the following way:
The idea here is that we think of these data points as being noisily sampled from a collection of high dimensional objects and we want to be able to estimate the dimension of the object from which each point is taken. More generally, we can estimate the dimension associated with any specific point in our our ambient space by looking at the "influence" various points in our data would have on it.
To clarify this idea, suppose our specified point in p-dimensional space is denoted as x0 = (x0,1,...,x0,p) (note: in theory x0 can be any point in the ambient space but in practice we iterate it over each of the points in the data set). We can then compute the distance from each point in our data to this point x0 using the standard L2 norm in the following way:
We can then determine how much influence each point has on x0 using softmax. More specifically, for a given "softmax distance" parameter d0 we can compute the weights w1,w2,...,wn and associated n-by-n weight matrix W as follows:
From here we simply perform weighted principal component analysis on the data. However, because we are shifting the data by x0 rather than the center of mass of the data set, it is important that we use the Singular Value Decomposition (SVD) method instead of the more standard covariance method. The "Thin SVD" algorithm allows us to find (1) an n-by-p semi-orthonormal matrix U, (2) a p-by-p diagonal matrix Σ of decreasing singular values, and (3) a p-by-p orthonormal matrix V satisfying the following:
Given that the diagonal entries of Σ satisfy σ1 ≥ σ2 ≥ … ≥ σp, the marginal (resp. cumulative) explained variance mi (resp. of ci) each PCA basis vector vi is given by
Artificially setting c0 = 0 allows us to define a piecewise linear function D:[0, 1] → [0, p] taking a cumulative explained variance as an input and giving an estimated dimension as an output. More specifically:
The goal of the following examples is to demonstrate that it is possible to recover the correct dimensionality of artificially created data sets. In this case, 1 and 2-dimensional objects were embedded in 5-dimensional space and then perturbed with uniformly sampled noise at a scale of ±0.05 in all dimensions. At least in the examples shown below, a cumulative percent variance of 75% tends to correctly estimate the dimension of the underlying object; further study is needed to see how well this holds for a wider variety of data.
This unit circle consists of 500 randomly generated points
A sufficiently small neighborhood of each point could be considered as a 2-dimensional region due to the presence of noise in the data
Similarly, traditional PCA would indicate that the vast majority of the variance would be explained using a 2-dimension model
However, this method using a softmax distance of 0.8 does a good job of extracting that a circle can be parametrized by a only single parameter
This unit disk consists of 500 randomly generated points
Generally points in an open disk are surrounded by a 2-dimensional neighborhood
However, it is interesting that a disk (being a manifold with boundary) ends up with a lower estimated dimension near said boundary
This unit sphere consists of 500 randomly generated points
Despite the lower overall density of the point cloud, the same softmax distance and percent variance as the other examples still manages to estimate the correct dimension
This merged unit circle and sphere consists of 1000 randomly generated points
The main point here is that the algorithm is able to distinguish points contained in the lower dimensional circle vs the higher dimensional sphere
The region of the circle near the intersection of the two shapes does have an overestimated dimension; this is an artifact of using 0.8 as the softmax distance
Main algorithms from the dimensional_analysis folder of the infrastructure repository
PCA algorithm from dimension_reduction.py
Dimension estimation from dimension_estimation.py
Specific demonstration figures from the dimension_demo folder of the active_projects repository
Circle demo from circle_data_demo.py
Disk demo from disk_data_demo.py
Sphere demo from sphere_data_demo.py
Merged demo from merged_data_demo.py
Additional basic functionality from the common_needs folder of the infrastructure repository
Custom RGB spectrum from color_helper.py
Linear spline from spline_helper.py
SQL functionality from sqlite3_helper.py
File dialogs from tkinter_helper.py
Object type checks from type_helper.py