300 likes | 463 Views
SEL3053: Analyzing Geordie Lecture 9. Dimensionality reduction 1. Lecture 8 noted that, once a data matrix has been constructed, it may have characteristics that can adversely affect the validity of any analysis based on it.
E N D
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Lecture 8 noted that, once a data matrix has been constructed, it may have characteristics that can adversely affect the validity of any analysis based on it. That lecture looked at one of these characteristics, variation in document length, and proposed a solution. The present lecture and the one following look at another characteristic: data sparsity. Some relevant geometrical concepts are first introduced, then these are used to explain the nature of the problem, and finally a solution is proposed.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts Data sparsity is a major issue in data analysis generally and in natural language text analysis technology more particularly. An intuition for what is meant by 'data sparsity' and why it is problematical can be gained via the concept of the manifold in Euclidean space.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.1 The nature of geometry Geometry is based on human intuitions about the world around us: that we exist in a space, that there are directions in that space, that distances along those directions can be measured, that relative distances between and among objects in the space can be compared, that objects in the space themselves have size and shape which can be measured and described. The earliest geometries were attempts to define these intuitive notions of space, direction, distance, size, and shape in terms of abstract principles which could, on the one hand, be applied to scientific understanding of physical reality, and on the other to practical problems like construction and navigation.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.1 The nature of geometry Basing their ideas on the first attempts by the early Mesopotamians and Egyptians, Greek philosophers from the seventh century BC onwards developed such abstract principles systematically, and their work culminated in the geometrical system attributed to Euclid (floruit c.300 BC). This Euclidean geometry was the unquestioned framework for understanding of physical reality until the 18th century CE.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.1 The nature of geometry By the nineteenth century AD, however, the validity of Euclidean geometry was seriously questioned for the first time both intrinsically and in terms of its validity as a description of physical reality. It was realized that the Euclidean was not the only possible geometry, and alternative ones in which, for example, there are no parallel lines and the angles inside a triangle always sum to less than 180 degrees, were proposed. Einstein eventually used such a non-Euclidean geometry as a more accurate description of curved spacetime than was possible with Euclidean geometry.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.1 The nature of geometry Since the nineteenth century these alternative geometries have, moreover, continued to be developed without reference to their utility as descriptions of physical reality, and as part of this development 'space' has come to have an entirely abstract meaning which has nothing obvious to do with the one rooted in our intuitions about physical reality. A space is a set on which one or more mathematical structures are defined, and is thus a mathematical object rather than a humanly-perceived physical phenomenon. The present discussion uses 'space' in the abstract sense; the physical meaning is often useful as a metaphor for conceptualizing the abstract one, though as we shall see it can easily lead one astray, and failure to keep the two meanings separate leads to confusion.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets A set is any collection of objects --of trains, teeth, policemen, numbers. Indeed, so general is the concept that the members of the set do not even have to have any obvious connection with one another: ships and snails and sealing wax, cabbages and kings. Intuitively, the notion of a set is simple and straightforward; it is also the foundation on which mathematics is built. A well developed set theory exists. For present purposes, however, only the following observations need to be made:
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets a) Sets are given labels for ease of reference in discussion. By convention, these labels are uppercase letters: A, B, C... b) There are two ways to define a set, that is, to specify which objects are included in a set:
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets i. The objects belonging to a set may be listed. To make it clear that such a list is meant to constitute a set, it is enclosed in curly brackets. For example, the days of the week may be specified as: {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets ii. Some criterion or criteria may be specified which can be used to determine which objects belong to set. This is the only practical course when the number of objects in a set is very large. It would be effectively impossible to list all the green objects in the world, for example, but one could easily state a rule for set membership: that any green object belongs to the set. Then, for any object, it would be possible to decide whether or not it belonged to the set. The usual way of writing such rules in mathematics is on the following pattern: G = {x | x is green}
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets G = {x | x is green} Note that this set is given an uppercase letter label, and that the definition is enclosed in curly brackets as before. The bar | is new: it is read as 'such that', and the above definition as a whole reads: 'G is the set of all x such that x is green. The character x here is a variable, a place holder, which can be used to stand for any object we might want to test for set membership. Thus, if we let x = 'tree', then x is a member of the set, but not if x = 'sky'.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets This style of set definition becomes indispensible when an infinite number of objects belongs to a set. The set of all green objects in the world is very large, but given enough time, manpower, and lunatic motivation a list could be made; the number of green objects is finite. But there are infinite sets, that is, sets whose members could never be listed, even given world enough and time. A familiar example is the set of positive numbers {1,2,3...}. No matter how large the last number in this sequence, it is always possible to add 1, thus creating a yet larger number, and the sequence consequently never ends. In such a situation it is not only impractical to make a list, but impossible, and as such the 'set definition by rule' approach is the only one available.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets c) To indicate that a given object belongs to a set, the symbol ɛ which is read as 'belongs to', is used: for a set A = {x | x is an animal}, cat ɛ A.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 1. Geometrical concepts 1.2 Euclidean space 1.2.1 Sets d) It is often necessary to refer not to a whole set, but to some part of it. A part of a set is called a subset. What this means is that every object in B is necessarily in A, but not vice versa. The situation can be visualised like this: The notion of a subset is precise. There is no such thing as 'more or less a subset': if there is even one object in B that is not also in A, then B is not a subset of A.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.2 Cartesian product Given two sets A and B, the Cartesian product of A and B is the set of all possible unique ordered pairings of members of A with members of B, that is A x B = {(a,b) | a ɛ A and b ɛ B}. If, for example, A = {v,w} and B = {x,y,z}, then A x B = {(v,x), (v,y), (v,z), (w,x), (w,y), (w,z)}
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.2 Cartesian product Note that: i. The result of set multiplication is itself a set. ii. Multiplication has yielded a set of pairs, such that each element from the first set is paired with each element from the second. iii. The pairs are ordered pairs. That is, the order of the components of each pair matters: (a,x) is not the same as (x,a). The order of pairs is determined by the order in which the sets are multiplied. Thus {a,b} x {x,y,z} yields the above set of pairs, while {x,y,z} x {a,b} yields: {(x,a), (x,b), (y,a)...etc}
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.2 Cartesian product Multiplication of sets is not limited to only two. Any number of sets may be multiplied. The Cartesian product of three sets A x B x C is the set of all possible ordered triples of members of A, B, and C, that is, A x B x C = {(a,b,c) | a ɛ A and b ɛ B and c ɛ C}. For example, {a,b} x {i,j} x {x,y} yields: {(a,i,x), (a,i,y), (a,j,x), (a,j,y), (b,i,x), (b,i,y), (b,j,x), (b,j,y)} The Cartesian product A x B x C x D is the set of all possible quadruples, of five sets 5-tuples, and so on for any number n of sets.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.2 Cartesian product The sets multiplied by Cartesian product need not be different; the same set can be multiplied by itself any number of times. N-fold multiplication A, for example, generates the set of all possible unique n-tuples of the the components of A; A x A generates the set of pairs {(a,a), (a,b), (b,a), (bb)}, A x A x A generates the set of triples {(a,a,a), (a,a,b), (a,b a), (a,b,b), (b,a,a), (b,a,b), (b,b a), (b,b,b)}, and so on. In general, given n sets A1...An, the set of all n-tuples is called the Cartesian product.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space If the mathematical structures of addition and multiplication are defined on the n-tuples of a Cartesian product X, then X together with these two structures is a vector space V.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space Cartesian coordinates provide a geometrical interpretation of vector space. Given a vector space V defined on R, the set of positive real numbers, each successive multiplication of R in the n-fold Cartesian product underlying V corresponds to an axis which represents the positive real numbers from 0 to infinity and which is orthogonal to any already-existing axes. The tuples of V are then coordinates relative to the axes, and the vectors comprising V are points at those coordinates.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space Forn = 2, that is, R x R, there are two orthogonal axes which represent a two-dimensional vector space, and each two-tuple or pair in the underlying Cartesian product contains the coordinates of a point in the space such that the first component of the pair indexes a point on the first axis and the second component a point on the second axis; the pair (2.3, 1.5) is shown as an example.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space For n = 3, that is, R x R x R, there are three orthogonal axes which represent a three-dimensional vector space, and each three-tuple or triple in the underlying Cartesian product contains the coordinates of a vector in the space such that the first component of the pair indexes a point on the first axis, the second component a point on the second axis, and the third component a point on the third axis, shown for (2.3, 1.5, 1.0).
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space In general, the dimension n of a vector space is the number of its basis vectors or, put another way, the minimum number of coordinates needed to locate a vector within it.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space The obvious objection to all this is that it is impossible to think about spaces of dimension higher than 3 or to represent them graphically, and thus that there is something strangely wrong with n-dimensional spaces. This objection is based on an ambiguity with respect to senses of the word 'space': the Greeks assumed a direct correspondence between physical and geometric space and thus understood 'space' physically, but that assumption has been abandoned in contemporary geometry except as a special case, and 'space' is an abstract mathematical concept.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 Geometrical concepts 1.2 Euclidean space 1.2.3 Vector space For n >= 4 the analogy between mathematical and physical space breaks down, in that four and higher dimensional spaces can neither be conceptualized nor represented in terms of physical space except in the special case where the fourth dimension is time, where the physical representation can be animated to show the evolution of the space over time. Mathematically and geometrically, higher-dimensional spaces are unproblematical in the sense that they are defined in the same way as the foregoing lower-dimensional ones, though their properties are often surprising, as subsequent discussion shows.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 2. Euclidean space and data Why all this talk about geometry? Because there is a fundamental relationship between geometrical space on the one hand, and the vectors and matrices which are standardly used to represent data, as described in a previous lecture. As we have seen, a vector is a sequence of n numbers, and the sequence is conventionally represented as comma-separated numerals between square brackets. The vector below shows n = 4 real-valued numbers, where the first number v1 is 1.6, the second v2 is 2.4, and so on. V =[1.6, 2.4, 7.5, 0.6]
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 2. Euclidean space and data A vector has a Euclidean geometrical interpretation: the dimensionality of the vector, that is, the number of its components n, defines an n-dimensional Euclidean space. the sequence of n numbers comprising the vector specifies the coordinates of the vector in the space. the vector itself is a point at the specified coordinates
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 2. Euclidean space and data For example, the components of the 2-dimensional vector v = [36, 160] below are its coordinates in a 2-dimensional, counting 36 along the horizontal axis and 160 along the vertical The components of the 3-dimensional vector v = [36, 160, 30] are its coordinates in a 3-dimensional vector space counting 36 along the horizontal axis, 160 along the vertical, and 70 along third axis which as shown as a diagonal for perspective.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 2. Euclidean space and data More than one vector can exist in a given Euclidean space. Where there is more than one vector in a data set, as is usual, they are standardly collected so as to constitute a matrix in which each row is a vector, as we have seen. For example, the three 2-dimensional row vectors in 2-dimensional space look like this.
SEL3053: Analyzing GeordieLecture 9. Dimensionality reduction 1 2. Euclidean space and data This principle applies to any number of vectors and any dimensionality. Let's say we had a 1000 x 3 matrix. Plotting the 1000 3-dimensional vectors in 3-dimensional space will give some shape; in this case it is a doughnut shape, or torus. The torus is, literally, the shape of the data. In other words, data has a shape. That shape is known as a manifold. For the purposes of this discussion a data manifold is a set of vectors in n-dimensional space.