230 likes | 369 Views
SEL3053: Analyzing Geordie Lecture 10. Dimensionality reduction 2. Lecture 9 introduced some geometrical concepts and their relationship to vectors and matrices. The present lecture applies these concepts to understanding the problem of data sparsity , and then proposes some solutions.
E N D
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Lecture 9 introduced some geometrical concepts and their relationship to vectors and matrices. The present lecture applies these concepts to understanding the problem of data sparsity, and then proposes some solutions.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 1. Data sparsity: the nature of the problem Assume a research domain in which the objects of interest are described by three variables, and a vector space representation of the data abstracted from the domain. If only two objects are selected there are only two three-dimensional vectors in the space, and the only reasonable manifold to propose is a straight line, as for the figure below; these vectors might belong to a more complex manifold constituted by data describing more objects in the domain, but with only two data points there is no justification for positing such a manifold.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 1. Data sparsity: the nature of the problem Where there are three vectors, the manífold can reasonably be interpreted as a curved line, but nothing more complex is warranted for the same reason as the one just given
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 1. Data sparsity: the nature of the problem It is only when a sufficiently large number of objects is represented in the space that the shape of the manifold representing the domain emerges. In the present case this happens to be a torus incorporating the vectors in the foregoing slides. The moral for present purposes is: to discern the shape of a manifold that satisfactorily describes the domain of interest, there must be enough data vectors to give it adequate definition.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem Getting enough data is usually difficult or even intractable as its dimensionality grows, however. The problem is that the space in which the manifold is embedded grows very quickly with dimensionality and, to retain a reasonable manifold definition, more and more data is required until, equally quickly, getting enough becomes impossible.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem What does it mean to say that 'the space in which the manifold is embedded grows very quickly with dimensionality'? Assume a two-dimensional space with horizontal and vertical axes in the range 0..9, and data vectors which can take integer, that is, whole-number values only, such as [1, 9], [7, 4], [3, 5] and so on. Since there are 10 x 10 = 100 such whole-number locations, there can be a maximum of 100 vectors in this space, as opposite.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem For a three-dimensional space with all three axes in the same range 0..9 the number of possible vectors like [0,9,2] and [3,4,7] in the space is 10 x 10 x 10 = 1000, as in figure 3.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem For a four-dimensional space the maximum number of vectors is 10 x 10 x 10 x 10 = 10000, and so on. In general, assuming integer data, the number of possible vectors is rd, where r is the measurement range (here 0..9 = 10) and d the dimensionality. The rdfunction generates an extremely rapid increase in data space size with dimensionality: even a modest d= 8 for a 0..9 range allows for 108 = 100,000,000 vectors.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem Why is this rapid growth of data space size with dimensionality a problem? Because, the larger the dimensionality, the more difficult it becomes to define the manifold sufficiently well to achieve reliable analytical results. Assume that we want to analyse, say, 24 speakers in terms of their usage frequency of 2 phonetic segments; assume also that these segments are rare, so a range of 0..9 is sufficient. The ratio of actual to possible vectors in the space is 24/100 = 0.24, that is, the vectors occupy 24% of the data space.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem If one analyses the 24 speakers in terms of 3 phonetic segments, the ratio of actual to possible vectors is 24/1000 = 0.024 or 2.4 % of the data space. In the 8-dimensional case it is 24/100000000, or 0.00000024 %. A fixed number of vectors occupies proportionately less and less of the data space with increasing dimensionality. In other words, the data space becomes so sparsely inhabited by vectors that the shape of the manifold is increasingly poorly defined.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem What about using more data? Let’s say that 24% occupancy of the data space is judged to be adequate for manifold resolution. To achieve that for the 3-dimensional case one would need 240 vectors, 2400 for the 4-dimensional case, and 24,000,000 for the 8-dimensional one. This may or may not be possible. And what are the prospects for dimensionalities higher than 8?
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Data sparsity: the nature of the problem Because provision of additional data to improve the definition of a sparse manifold is not always possible, the alternatives are either to use the data as is and to live with the consequent analytical unreliability, or to attempt to reduce the sparsity. This module addresses the latter alternative, and presents some ways of achieving it.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods There are numerous ways of reducing data dimensionality, and we can't go into or indeed refer to all of them here. The remainder of the lecture therefore concentrates on an intuitively accessible method: the amount of variation in the values that a variable takes.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods Identical objects cannot be distinguished from one another, and cannot therefore be usefully cluster analyzed because clustering depends on being able to group objects by their similarities and differences from one another. How would one group 1000 cars all of which have the same manufacturer, are the same colour, have the same engine, and so on? When, therefore, the objects of interest in a cluster analysis are described by variables, a variable is useful only if there is significant variation in the values that it takes.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods If, for example, a large random collection of people was described by variables like ‘height’, ‘weight’, and ‘income’, there would be substantial variation in values for each of them, and they could legitimately be used to cluster the people in the sample. On the other hand, a variable like ‘number of limbs’ would be effectively useless, since, with very few exceptions, everyone has a full complement of limbs --there would be almost no variation in the value 4 for this variable. In any clustering exercize, therefore, one is looking for variables with substantial variation in their values, and can disregard variables with little or no variation. The degree of variation in the values of a variable is described by its variance, the nature and calculation of which is described in what follows.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Direct inspection suggests that the value at the centre of this distribution is around 10 or 12. A more precise indication is given by the formula 2. Dimensionality reduction methods We begin with the mean or 'average' of variable values. Given a variable x whose values are represented as a vector of n numbers distributed across some range, the mean of those values is the value at the centre of the distribution. The values have been sorted by magnitude for ease of interpretation. where μ is the conventional symbol for 'mean', Σ denotes summation, and n is the number of values in x: the mean of a set of n values is their sum divided by n. In the case of figure 1 this is (2 + 4 + .. + 20 = 110) / 10 = 11.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 The means are identical, but the variations across the mark scale differ strikingly: student A is capable of both failure and excellence, and student B is remarkably consistent. Knowing only the averages one could not make the distinction. Both the average and an indication of the spread of marks across the range are required in order to do proper justice to these students. 2. Dimensionality reduction methods The mean hides important information about the distribution of values in a vector. Consider, for example, these two (fictitious) runs of student marks A and B on a percentage scale:
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 Assessing the spread can be problematic in practice, however. Where the number of marks is few, as in the above example, visual inspection is sufficient, but what about longer runs? Visual inspection quickly fails; some quantitative measure that summarizes the spread of marks is required. That measure is variance. 2. Dimensionality reduction methods
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods Given a variable x whose values are represented as a vector of n values [x1, x2...xn], variance is calculated as follows: i. The mean of these values µ is (x1 + x2 + ... + xn) / n. ii. The amount by which any given value xi differs from µ is then xi - µ. iii. The average difference from µ across all values is therefore Σi=1..n (xi - µ) / n.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods iv. This average difference of variable values from their mean almost but not quite corresponds to the definition of variance. One more step is necessary, and it is technical rather than conceptual. Because µ is an average, some of the variable values will be greater than µ and some will be less. Consequently, some of the differences (xi - µ) will be positive and some negative. When all the (xi - µ) are added up, as above, they will cancel each other out. To prevent this, the (xi - µ) are squared.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods v. The definition of variance for n values x = [x1, x2...xn], therefore, is: Thus, the variance for the A run of marks above is (40-58)2 + (30 - 58)2 + ... + 30-58)2 / 10 = 594.44. Doing the same calculation for student B, the variance works out as 8.00. Comparing the two variances, it is clear that the variability in A's run of marks is much greater than B's.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods Given a data matrix in which the rows are the objects of interest and the columns are variables describing them, and also that the aim is to cluster the objects on the basis of the differences among them, the application of variance to dimensionality reduction is straightforward: sort the column vectors in descending magnitude of variance and use a plot of the variances to decide on a suitable threshold below which all columns are eliminated. This was done for our DECTE data matrix M, and the result is shown opposite.
SEL3053: Analyzing GeordieLecture 10. Dimensionality reduction 2 2. Dimensionality reduction methods There are a few high-variance variables, a large number of low-variance variables, and a moderate number of intermediate-valued ones in between. Intuitively, the variance of the variables to the right of the 50th seem so low relative to the high and intermediate variance ones that they can be eliminated, thereby achieving a very substantial dimensionality reduction from 156 to 50. Why 50 and not, say, 45 or 70? There is no definitive answer; the researcher must use his or her judgment.