The dimension estimation algorithm on the Dimension Demo: Part 1 page is a special case of the algorithm where the "softmax distance" parameter d0 is left constant. While this can be useful for extracting dimensional information for a given sized feature of the data set, at times it will be unclear what (if any) single neighborhood size is the correct one for the data. In this context, the general idea of "persistence" is that we can examine the range of softmax distances for which a given point will be associated with particular estimated dimension; the wider the range the more likely it is to be a valid estimate. Note: For a more complete background on PCA, please refer to the Dimensional Analysis page.
The idea of this algorithmic generalization is to simply allow the softmax parameter d0 to be a continuous variable that is used as an input. In practice we will end up selecting a finite discrete sampling of distances for which computations are handled and we can then linearly interpolate to fill in the gaps; this will only provide an approximation of the correct dimension estimate for some values of d0, but as we will see this method ends up having a fairly small level of error. From this point on, the following notational conventions are used in order to maintain clarity:
Numbers depending on d0 will be denoted using function notation, e.g. the ith weight would be written as wi(d0)
Matrices depending on d0 can be thought of as a 1-parameter family of matrices indexed by the softmax distance (see below)
From here, the Thin SVD algorithm allows us to solve for the singular value functions σ1(d0) ≥ σ2(d0) ≥ … ≥ σp(d0) satisfying the following equation:
This in turn allows us to define the marginal and cumulative explained variances as functions of the softmax distance:
Finally, we can compute the estimated dimension at the point x0 as both a function of softmax distance d0 and cumulative explained variance c using piecewise linear interpolation along the variance direction of the constructed surface:
Now that the necessary adjustments have been made to the algorithm, we can apply this updated methodology to our example data sets. For example, we have previously examined the noisy unit circle data using a softmax distance of 0.8 and an explained variance of 75%. However, let us now plot the estimated dimension of a given point (at index 0 for this example) as the softmax distance grows from 0.1 to 2.5:
As we can see, the estimated dimension for our selected point can vary significantly depending on the value of d0. For almost all considered softmax distances, we have that the dimension estimate lies between 0.5 and 1.5 (and therefore would be rounded to the nearest integer 1). However, for values of d0 less than approximately 0.2 we see that the dimension estimate is closer to 2; this is indicative of the fact that each point in the circular data set is surrounded by a rather small 2-dimensional neighborhood. Of potentially deeper interest is the local minimum (resp. maximum) estimate occurring at a softmax distance of 0.4 (resp. 1.1). One might initially believe these stationary points to be an artifact of the density of the data set, however as seen in the results section below this is not the case and instead something fundamentally more geometric is likely the cause.
In regards to the general choice here of choosing softmax distance of 0.8 and explained variance of 75% (both here and in Dimension Demo: Part 1), one could certainly be forgiven for believing these parameters to be carefully chosen so as to provide the a priori determined dimension values. However, across the fairly sizable ranges of 0.1 to 2.5 for softmax distance and 50% to 90% for explained variance, the following figure (associated with the above circular data set) demonstrates that the dimension estimate rounds to the correct integer value of 1 the vast majority of the time:
The goal of the following examples is to revisit our artificially created data sets and examine how the dimension estimates vary as functions of both softmax distance and explained variance. All of the data sets used here come from a 5-dimensional ambient space. In addition, all examples use the ranges of 0.1 to 2.5 and 50% to 90% for softmax distance and explained variance respectively. Finally, each example below contains three different versions of the data set in question: one with low sample density, another with medium sample density, and a final one with high sample density. Please use the left and right arrows to iterate through the images in each carousel.
Centrally located data points seem to have an estimated dimension a little above 3
Data points more on the periphery of the cluster have an estimated dimension significantly closer to a value of 2
This phenomenon of lower dimension near the boundary is similar to the disk example from Dimension Demo: Part 1
Notice how the estimated dimension for each point doesn't really vary as the density of the data changes
It is interesting that for all considered softmax distances, the estimated dimension increases linearly as a function of explained variance and that said linear rate doesn't really depend on the variance used
Moving left to right, it is also interesting how the dimension estimate quickly increases up to a softmax distance of 0.5 and then subsequently plateaus (and if anything slightly decreases further to the right)
Notice how the softmax distance at which the dimension estimate reaches its maximum value doesn't really vary as the density of the data changes
Notice how the estimated dimension for each point varies only slightly as the density of the data changes
For example, look at how relatively isolated points in the low density figure still register an estimated dimension of 1; this is largely due to the threshold for "isolation" being fairly small relative to the chosen softmax distance value of 0.8
Note that this fact will end up changing for very small values of d0 (as can be seen in the surface plots below)
As noted above, moving left to right across the surface gives a local minimum and maximum for the estimated dimension; what's interesting about this is the fact that the softmax distances at which these extremes occur do not seem to change as functions of the explained variance
In addition, like the other examples shown on this page the surface plots appear very similar regardless of the density used
Notice how the estimated dimension for each point doesn't really vary as the density of the data changes
In fact, even at significantly lower or higher sample densities the algorithm is able to distinguish between the 1-dimensional circle and the 2-dimensional sphere
Main algorithms from the dimensional_analysis folder of the infrastructure repository
PCA algorithm from dimension_reduction.py
Dimension estimation from persistent_dimension.py
Specific demonstration figures from the dimension_demo folder of the active_projects repository
Random demo from random_data_demo.py
Circle demo from circle_data_demo.py
Merged demo from merged_data_demo.py
Additional basic functionality from the common_needs folder of the infrastructure repository
Custom RGB spectrum from color_helper.py
SQL functionality from sqlite3_helper.py
File dialogs from tkinter_helper.py
Object type checks from type_helper.py