900 likes | 946 Views
Discover the power of Persistent Homology and Principal Component Analysis in extracting meaningful information from high-dimensional data for clearer visualization and cluster identification in data analysis. Overcome the limitations of PCA sensitivity to scale with topological data analysis's shape-insensitive approach to data exploration. Learn how to construct higher-dimensional structures using simplicial complexes and interpret Betti numbers to approximate data shapes with algorithms that enable efficient analysis and data subset selection.
E N D
Ben Fraser May 27, 2015 Persistent Homology in Topological Data Analysis
Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it
Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph
Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph • However, in higher dimensions, visualization becomes difficult
Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph • However, in higher dimensions, visualization becomes difficult • A possible solution: dimensionality reduction
Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component
Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance
Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance • We could discard these less important principal components to reduce the dimensionality of the data while retaining as much of the variance as possible
Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance • We could discard these less important principal components to reduce the dimensionality of the data while retaining as much of the variance as possible • Then may be easier to graph: identify clusters
Principal Component Analysis • Done by computing the singular value decomposition of X (each row is a point, each column a dimension):
Principal Component Analysis • Done by computing the singular value decomposition of X (each row is a point, each column a dimension): • Then a truncated score matrix, where L is the number of principal components we retain:
Principal Component Analysis • 8-dim data → 2-dim to locate clusters:
Principal Component Analysis • 3-dim → 2-dim collapses cylinder to circle:
Principal Component Analysis • Scale sensitive! Same transformation produces poor result on same shape/different scale data
Data Analysis • One weakness of PCA is its sensitivity to the scale of the data
Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data
Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data • We want something insensitive to scale which can identify shape (why?)
Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data • We want something insensitive to scale which can identify shape (why?) • Because “data has shape, and shape has meaning” - Ayasdi (Gunnar Carlsson)
Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes
Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology
Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology • Display Betti numbers in graph form
Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology • Display Betti numbers in graph form • Essentially, we approximate the shape of the data by building a graph on it and considering cliques as higher dimensional objects, and counting the cycles of such objects.
Algorithm • Since scale doesn't matter in this analysis, we can normalize the data.
Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with
Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with • We would ideally like this subset to be representative of the original data (but how?)
Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with • We would ideally like this subset to be representative of the original data (but how?) • This process is called landmarking
Landmarking • The method used here is minMax
Landmarking • The method used here is minMax • Start by computing a distance matrix D
Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L
Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L • Then choose each subsequent i-th point to add as that which has maximum distance from the landmark it is closest to:
Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L • Then choose each subsequent i-th point to add as that which has maximum distance from the landmark it is closest to: li = p such that dist(p,L) = max{dist(x,L) ∀ x ϵ X} dist(x,L) = min{dist(x,l) ∀ l ϵ L}
Landmarking • Landmarking is not an exact science however: on certain types of data the method just used may result in a subset very unrepresentative of the original data. For example:
Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data
Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data • Next we keep only the distance matrix between the landmark points, and normalize it
Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data • Next we keep only the distance matrix between the landmark points, and normalize it • This is all the information we need from the data: the actual position of the points is irrelevant, all we need are the distances between the landmarks, on which we will construct a neighbourhood graph
Neighbourhood Graph • Our goal is to create a nested sequence of graphs. To be precise, by adding a single edge at a time, between points x,y ϵ L, where dist(x,y) is the smallest value in D. Then replace the distance in D with 1.
Neighbourhood Graph • Our goal is to create a nested sequence of graphs. To be precise, by adding a single edge at a time, between points x,y ϵ L, where dist(x,y) is the smallest value in D. Then replace the distance in D with 1. • At each iteration of adding an edge, we keep track of r = dist(x,y), r ϵ [0,1]: this is our proximity parameter, and will be important when we graph the Betti numbers later.
Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap)
Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap) Def: A point x is a strong witness to a p-simplex (a0,a1,...ap) in A if x is a weak witness and additionally, |x-a0| = |x-a1| = … = |x-ap|.
Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap) Def: A point x is a strong witness to a p-simplex (a0,a1,...ap) in A if x is a weak witness and additionally, |x-a0| = |x-a1| = … = |x-ap| The requirement may be added that an edge is only added between two points if there exists a weak witness to that edge.
Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex
Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph)
Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph) • A face of a simplex is a sub-simplex of it
Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph) • A face of a simplex is a sub-simplex of it • A simplicial k-complex is a set S of simplices, each of dimension ≤ k, such that a face of any simplex in S is also in S, and the intersection of any two simplices is a face of both of them
Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices
Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex
Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex • A k-simplex is formed if the intersection of neighbourhoods of a k-2 simplex contains the two points in the added edge
Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex • A k-simplex is formed if the intersection of neighbourhoods of a k-2 simplex contains the two points in the added edge • In other words, if every point in a k-2 simplex is joined to the two points in the edge, then together they form a k-simplex
Boundary Matricies • Next we compute boundary matricies. Essentially, these store the information that k-1 simplices are faces of certain k simplices
Boundary Matricies • Next we compute boundary matricies. Essentially, these store the information that k-1 simplices are faces of certain k simplices • For instance, in a simplicial complex with 100 triangles and 50 tetrahedra, the 4th boundary matrix has 100 rows and 50 columns, with zeros everywhere except where the given triangle is a face of the given tetrahedron, where it is 1.