Unveiling Data Insights with Topological Analysis

Ben Fraser May 27, 2015 Persistent Homology in Topological Data Analysis

Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it

Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph

Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph • However, in higher dimensions, visualization becomes difficult

Data Analysis • Suppose we start with some point cloud data, and want to extract meaningful information from it • We may want to visualize the data to do so, by plotting it on a graph • However, in higher dimensions, visualization becomes difficult • A possible solution: dimensionality reduction

Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component

Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance

Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance • We could discard these less important principal components to reduce the dimensionality of the data while retaining as much of the variance as possible

Principal Component Analysis • Essentially, fits an ellipsoid to the data, where each of its axes corresponds to a principal component • The smaller axes are those along which the data has less variance • We could discard these less important principal components to reduce the dimensionality of the data while retaining as much of the variance as possible • Then may be easier to graph: identify clusters

Principal Component Analysis • Done by computing the singular value decomposition of X (each row is a point, each column a dimension):

Principal Component Analysis • Done by computing the singular value decomposition of X (each row is a point, each column a dimension): • Then a truncated score matrix, where L is the number of principal components we retain:

Principal Component Analysis • 8-dim data → 2-dim to locate clusters:

Principal Component Analysis • 3-dim → 2-dim collapses cylinder to circle:

Principal Component Analysis • Scale sensitive! Same transformation produces poor result on same shape/different scale data

Data Analysis • One weakness of PCA is its sensitivity to the scale of the data

Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data

Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data • We want something insensitive to scale which can identify shape (why?)

Data Analysis • One weakness of PCA is its sensitivity to the scale of the data • Also, it provides no information about the shape of our data • We want something insensitive to scale which can identify shape (why?) • Because “data has shape, and shape has meaning” - Ayasdi (Gunnar Carlsson)

Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes

Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology

Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology • Display Betti numbers in graph form

Topological Data Analysis • Constructs higher-dimensional structure on our point cloud via simplicial complexes • Then analyze this family of nested complexes with persistent homology • Display Betti numbers in graph form • Essentially, we approximate the shape of the data by building a graph on it and considering cliques as higher dimensional objects, and counting the cycles of such objects.

Algorithm • Since scale doesn't matter in this analysis, we can normalize the data.

Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with

Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with • We would ideally like this subset to be representative of the original data (but how?)

Algorithm • Since scale doesn't matter in this analysis, we can normalize the data. • Also, since we don't want to work with the entire data set (especially if it is very large), we want to choose a subset of the data to work with • We would ideally like this subset to be representative of the original data (but how?) • This process is called landmarking

Landmarking • The method used here is minMax

Landmarking • The method used here is minMax • Start by computing a distance matrix D

Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L

Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L • Then choose each subsequent i-th point to add as that which has maximum distance from the landmark it is closest to:

Landmarking • The method used here is minMax • Start by computing a distance matrix D • Then choose a random point l1 to add to the subset of landmarks L • Then choose each subsequent i-th point to add as that which has maximum distance from the landmark it is closest to: li = p such that dist(p,L) = max{dist(x,L) ∀ x ϵ X} dist(x,L) = min{dist(x,l) ∀ l ϵ L}

Landmarking • Landmarking is not an exact science however: on certain types of data the method just used may result in a subset very unrepresentative of the original data. For example:

Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data

Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data • Next we keep only the distance matrix between the landmark points, and normalize it

Algorithm • As long as outliers are ignored, however, the method used works well to pick points as spread out as possible among the data • Next we keep only the distance matrix between the landmark points, and normalize it • This is all the information we need from the data: the actual position of the points is irrelevant, all we need are the distances between the landmarks, on which we will construct a neighbourhood graph

Neighbourhood Graph • Our goal is to create a nested sequence of graphs. To be precise, by adding a single edge at a time, between points x,y ϵ L, where dist(x,y) is the smallest value in D. Then replace the distance in D with 1.

Neighbourhood Graph • Our goal is to create a nested sequence of graphs. To be precise, by adding a single edge at a time, between points x,y ϵ L, where dist(x,y) is the smallest value in D. Then replace the distance in D with 1. • At each iteration of adding an edge, we keep track of r = dist(x,y), r ϵ [0,1]: this is our proximity parameter, and will be important when we graph the Betti numbers later.

Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap)

Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap) Def: A point x is a strong witness to a p-simplex (a0,a1,...ap) in A if x is a weak witness and additionally, |x-a0| = |x-a1| = … = |x-ap|.

Witness Complex Def: A point x is a weak witness to a p-simplex (a0,a1,...ap) in A if |x-a| < |x-b| ∀ a ϵ (a0,a1,...ap), and b ϵ A \ (a0,a1,...ap) Def: A point x is a strong witness to a p-simplex (a0,a1,...ap) in A if x is a weak witness and additionally, |x-a0| = |x-a1| = … = |x-ap| The requirement may be added that an edge is only added between two points if there exists a weak witness to that edge.

Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex

Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph)

Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph) • A face of a simplex is a sub-simplex of it

Simplicial Complexes • Next we want to construct higher dimensional structure on the neighbourhood graph: called a simplicial complex • A simplex is a point, edge, triangle, tetrahedron, etc... (a k-simplex is a k+1-clique in the graph) • A face of a simplex is a sub-simplex of it • A simplicial k-complex is a set S of simplices, each of dimension ≤ k, such that a face of any simplex in S is also in S, and the intersection of any two simplices is a face of both of them

Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices

Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex

Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex • A k-simplex is formed if the intersection of neighbourhoods of a k-2 simplex contains the two points in the added edge

Simplicial Complexes • At each iteration, we add an edge: all we need to do is see if that creates any new k-simplices • The edge itself adds a single 1-simplex to the complex • A k-simplex is formed if the intersection of neighbourhoods of a k-2 simplex contains the two points in the added edge • In other words, if every point in a k-2 simplex is joined to the two points in the edge, then together they form a k-simplex

Boundary Matricies • Next we compute boundary matricies. Essentially, these store the information that k-1 simplices are faces of certain k simplices

Boundary Matricies • Next we compute boundary matricies. Essentially, these store the information that k-1 simplices are faces of certain k simplices • For instance, in a simplicial complex with 100 triangles and 50 tetrahedra, the 4th boundary matrix has 100 rows and 50 columns, with zeros everywhere except where the given triangle is a face of the given tetrahedron, where it is 1.

Unveiling Data Insights with Topological Analysis

Unveiling Data Insights with Topological Analysis

Presentation Transcript

Homology

homology

MODELLING PERSISTENT DATA: CS27020

MODELLING PERSISTENT DATA: CS27020

MODELLING PERSISTENT DATA: CS27020

Topological Vulnerability Analysis (TVA)

Persistent Data Structures in OCaml

Topological Data Analysis

Persistent Data Structures

Linear Programming Persistent Data

CS27020 Modelling Persistent Data

Persistent data structures

Persistent Data-only Malware:

Homology

Topological Data Analysis

Homology

Persistent Homology and Sensor Networks

Persistent data structures

Homology

Persistent Data Management