250 likes | 370 Views
SEL3053: Analyzing Geordie Lecture 12. Hierarchical cluster analysis - Introduction and distance table creation.
E N D
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation This and the following lecture describe a particular kind of cluster analysis, hierarchical analysis, and its application to the data matrix M abstracted from the TLS / DECTE phonetic transcriptions.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Lecture 11 developed the motivation for using cluster analysis in general: to discover regularities in data of interest with respect to some research question, to represent those regularities in an intuitively accessible way, and to use that representation as the basis for hypothesis generation. In the light of the foregoing discussion of data creation and transformation, this motivation can now be reformulated as the search for and representation of nonrandomness in the distribution of vectors in an n-dimensional data space for hypothesis generation.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Consider, for example, the plot of 100 three-dimensional randomly-generated vectors opposite. The vectors are not uniformly distributed in the data space, which is what one would expect for so small a number of random trials. Visual inspection suggests some weak regularities, but these are hard to pin down, and we know from the way the data was generated that any such structure is an accidental byproduct of randomness.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Contrast the following plot of a known, nonrandom 3-dimensional data set. Visual inspection makes it immediately apparent that the distribution of points is nonrandom: there are three clearly defined groups of vectors such that intra-group distance is small relative to the dimensions of the data space, and inter-group distance relatively large. Cluster analysis is a collection of methods whose aim is to detect such groups in data and to display them graphically in an intuitively accessible way.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation If that's all there is cluster analysis, what need is there for the method about to be presented? Why not simply plot the points in the data space? The answer is that this works well for data dimensionalities up to 3, since they can be visually represented, and this facility has in fact long been extensively exploited in low-dimensional statistics and data analysis. It is even possible to represent a four-dimensional data set if the fourth dimension happens to be time by animating the movement of points in a three dimensional plot.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation For higher dimensionality, however, the straightforward diagrammatic approach breaks down, as we have already seen: how does one represent a 5-dimensional space graphically, not to speak of a 100-dimensional or 1000-dimensional one? The various methods about to be presented address this problem. Specifically, they address the problem of finding clusters in arbitrarily high-dimensional spaces and of representing these clusters in a dimensionality that can be plotted and intuitively interpreted in a low dimensional, that is, two or three dimensional space.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation There is an extensive range of cluster analysis methods, but the one presented here, hierarchical cluster analysis, is both the most widely used and the easiest to understand. It represents the relative distances among vectors in the data manifold as a constituency tree. This lecture first looks at a simple example, and then goes into the detail of how such a tree is constructed.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation The figure shows a 30 x 2 data matrix, and the aim is to discover whether there is any interesting cluster structure among its 30 row vectors. Because there vectors are two-dimensional they can be directly plotted. This is shown in the upper part of the figure, and there is a clear 3-cluster structure, with clusters labelled A, B and C. The corresponding hierarchical cluster tree is shown in the lower part.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Linguists use such trees as representations of sentence phrase structure, but cluster trees differ from linguistic ones in the following respects. Their leaves are not lexical tokens but labels for the data items -the numbers at the leaves correspond to the numerical labels of the row vectors in the data matrix. They represent not grammatical constituency but relativities of distance between clusters. The lengths of the branches linking the clusters represent degrees of closeness: the shorter the branch, the more similar the clusters.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Knowing this, the tree can be interpreted as follows. There are three clusters labelled A, B, and C in each of which the distances among vectors are quite small. These three clusters are relatively far from one another, though A and B are closer to one another than either of them is to C.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Comparison with the plot shows that the hierarchical analysis accurately represents the distance relations among the 30 vectors in 2-dimensional space shown in the plot. Thus:
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation The longest horizontal lines at the right of the tree separate the vectors into three main groups. Comparison of these groups with the clusters in the plot show them to be identical, and these are therefore labelled A-C. Moreover, in the plot, A and B are slightly closer to one another than to C, and C is equidistant from A and B. This is reflected in the somewhat greater length of the line linking C to A and B in the diagram relative to those linking A and B.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation Within the clusters there is a systematic correspondence between the lengths of the lines linking clusters and subclusters on the one hand, and the distances between and among vectors in the plot. In cluster A, for example, vectors 4 and 19 are very close together in the data space, as are 2 and 3, and both of these pairs are close to vector 1. This is reflected in the cluster tree by the relative lengths of the horizontal lines and by the constituency structure of the distance relations among these five vectors
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation From the foregoing comments, it's clear that the plot and the tree are just alternative representations of the cluster structure of the data, and provide the same information. Given that the tree tells us nothing more than what the plot tells us, what is gained? In the present case, nothing. The real power of hierarchical analysis lies in its independence of vector space dimensionality.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation We have seen that direct plotting is limited to three or fewer dimensions. But there is no dimensionality limit on hierarchical analysis -it can determine relative distances in vector spaces of any dimensionality and represent those distance relativities as a tree like the one above.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works Given a data matrix, constructing a hierarchical cluster tree for it is a two-step process. The first step is to construct a table of distances between all the row vectors of the matrix, and the second is to use that table as the basis for tree construction. We will look at these two steps separately.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table So far, the discussion has relied on visual intuitions about distance in vector space. That intuition now has to be made more specific with a few additional concepts related to vector space. The length of a vector is the distance between itself and some reference point in the space's coordinate system; for present purposes that reference point is taken to be the origin of the coordinate axes. Where, moreover, there is more than one vector in a space, their lengths can be compared: in the figure opposite, the length of vector A is greater than the length of B.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table There is an angle between any two vectors in a space. The one between A and B is shown as θ in the figure opposite.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table The distance between two vectors can be measured and relative distances between pairs of vectors compared. Distance(AB) in the figure opposite is greater than distance(AC); this is the basis for several types of clustering method.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table The distance between any two vectors in a space is jointly determined by the size of the angle between the lines joining them to the origin of the space's coordinate system, and by the lengths of those lines. Assume two vectors A and B having identical lengths and separated by an angle θ.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table If the angle is kept constant and the lengths of the vectors are made unequal by lengthening or shortening one of them, then the distance increases (a) and (b). If the lengths are kept equal but the angle is increased the distance between them increases (c), and if the angle is decreased so does the distance (d).
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table How is distance between any pair of vectors in a space measured? The cluster analysis literature provides a variety of ways, but for present purposes the one that is most often used and easiest to understand will be sufficient. This measure is the Euclidean distance measure, and most people have encountered it at some point in their school careers: In a right-angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table Thus, given a pair of two-dimensional vectors v1 = (2,1) and v2 = (5,6)… The length of the hypotenuse is the Euclidean distance between v1 and v2.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works 1. Construction of a distance table To construct the distance table for a data matrix, the Euclidean distance between every possible pair of row vectors is calculated and stored in a matrix.
SEL3053: Analyzing GeordieLecture 12. Hierarchical cluster analysis - Introduction and distance table creation How hierarchical cluster analysis works Construction of a distance table Comparing the matrix values to the earlier data plot quickly shows that these values accurately capture visual intuitions about distances in two-dimensional space.