280 likes | 304 Views
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets. Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo 20 th International Conference on Computer and Their Applications March 16-18, 2005 Louisiana, New Orleans. Outline:.
E N D
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo 20th International Conference on Computer and Their Applications March 16-18, 2005 Louisiana, New Orleans
20th International Conference on CATA 2005 Outline: • Introduction • Motivation • Contribution • P-tree Vertical Data Structure • Vertical Set Square Distance (VSSD) • Experimental Results • Conclusion and Future Works
20th International Conference on CATA 2005 Introduction • The determination of closeness (distance) of a point to another point or to a set of points (average distance) is required in data mining tasks. • For example: • Distance-based clustering. Distance is used to determine in which cluster a certain point should be placed, i.e. K-means. • Near nbr or case-based classification The assignment of class label to an unclassified point is based on the majority of class label of the nearest neighbor points. “Nearest” here implies the closest distance of points.
20th International Conference on CATA 2005 Introduction (cont) • One way to measure the distance from a certain point to a set of points is by examining the actual total separation of the set of points from the point in question • However, if the examination is done point by point when the set is very large, scanning the entire space causes the approach to be very costly, slow and non-scalable. • In this paper, we introduce a new technique called the Vertical Set Square Distance(VSSD) that can scalably, quickly and accurately compute the total length of separation (total variation) of a set of points about a point.
20th International Conference on CATA 2005 Motivation • In data mining, efforts have focused on finding techniques that can scalably deal with large cardinality of dataset due to the explosive growth of data stored in digital form. • Most existing approaches for measuring closeness of points are slow and computationally intensive (often relying upon sampling to become useable). • The availability of P-tree technology (vertical data structure) is an appropriate data structure to solve this curse of cardinality.
20th International Conference on CATA 2005 Our Contributions • We introduce a new vertical technique that scalably, quickly and accurately computes the total length of separation of a set of points about a fixed point. • We present empirical results based on real and synthetic datasets and show the one-to-one comparison in terms of scalability and speed (time) between our new method that uses the vertical approach (P-tree vertical data structure) and other method that uses the horizontal approach (record-based).
20th International Conference on CATA 2005 P-Tree Vertical Data Structure • P-tree vertical data representation consists of set structures essentially representing the data column-by-column rather than row-by-row (i.e. relational data). • The construction of P-trees from an existing relational table is typically by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). The construction for raw data coming from a sensor platform is done directly (without creating a horizontal relational table first. Note that many sensor platforms (e.g., RSI sensors), product raw vertical data.
010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R[A1] R[A2] R[A3] R[A4] R(A1 A2 A3 A4) = 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 1 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 0 0 1 0 1 01 0 1 0 20th International Conference on CATA 2005 The Construction of P-Tree The construction of the P- Tree: 1. Vertically project each attribute. 2. Vertically project each bit position. 3. Compress each piece of bit slice into a P-tree. Logical operations are AND (), OR () and complement ('). Root count operation is the count of 1-bits from the resulting P-trees logical operations
20th International Conference on CATA 2005 Vertical Set Square Distance Binary Representation Let x be a numerical domain of attribute Ai of relation R(A1, A2, …, Ad), then x written in b bits binary number is shown in the left-hand side term of the equation below. The first subscript of x is the index of the attribute to which x belong. The second subscript of x indicates the bit position. The summation on the right-hand side of the equation is equal to the actual value of x as a decimal number.
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) Let X be a set of vectors in a relation R(A1, A2, …, Ad), let PX be the P-tree class mask of X, x is a vector : and a is a target vector (also in d-dimensional space), then the vectors x and a in binary can be written as: The total variation of a set of vectors in X about a vector a can be computed vertically using Vertical Set Square Distance(VSSD) defined as follows:
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) Term T1, T2 and T3 can be computed separately and their summation is actually the total variation (the sum of the square length of separation of the vectors connecting X and a.
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) Or we can write T1, expressing the diagonal terms (j=k) separately (noting also that xi,j2 = xi,j as they are bits)
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) And T3 is simply:
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) The root count operation is obviously independent of the vector a. These include the root count of single P-tree operand, PX, and the root count of two basic P-tree operands. and the root count of three P-tree operands which appear in T1, T2 and T3.
20th International Conference on CATA 2005 Vertical Set Square Dist. (Cont) • The independency allows us to pre-compute the root counts once in advance, during the construction of the P-tree, and can use them repeatedly as needed, regardless of the number of target vectors a. • This amortizes the cost of P-tree ANDing for high value datasets, e.g. cancer analysis, RSI, etc.
20th International Conference on CATA 2005 Horizontal Set Square Distance(HSSD) as a comparison method Let X, a set of vectors in R(A1,A2 … Ad) and x = (x1, x2, …, xd) is a vector belong to X and a = (a1, a2, …, ad) is a target vector Then the horizontal set square distance (HSSD) is defined as follows:
20th International Conference on CATA 2005 Experimental Results • The experiments were conducted using both real and synthetic datasets. • The goals are to compare the execution time (speed) and scalability of our algorithm employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with a horizontal approach (horizontal data structure and vertical scan operation). • Performance of both algorithms was observed under different machine specifications, including an SGI Altix machine.
Machine Specification AMD1GB AMD Athlon K7 1.4GHz, 1GB RAM P42GB Intel P4 2.4GHz processor, 2GB RAM SGI Altix SGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM) 20th International Conference on CATA 2005 Experimental Results (Cont.) The specification of machines used in the experiments.
20th International Conference on CATA 2005 Datasets • A set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, ND. • The original image size (dataset) is 1024x1024 pixels (cardinality of 1,048,576), contains 3 bands (RGB) plus synchronized data for soil moisture, soil nitrate and crop yield (total 6 attributes).
20th International Conference on CATA 2005 Datasets (Cont.) • The datasets contain 4 different yield classes: • Low Yield ( 0 < intensity <= 63) • Medium Low Yield ( 63 < intensity <= 127) • Medium High Yield (127< intensity <= 191) • High Yield (191< intensity <= 255) • With super sampling five synthetic datasets are created: • 2,097,152 rows • 4,194,304 rows (2048x2048 pixels) • 8,388,608 rows • 16,777,216 rows (4096x4096 pixels) • 25,160,256 rows (5016x5016 pixels)
20th International Conference on CATA 2005 Timing and Scalability Results Observation 1. • The first performance evaluation was done on a P4 2 GB RAM machine using synthetic datasets having 4.1 and 8.3 million data tuples. • We use 100 arbitrary unclassified target vectors (resulting in 400 classification computations as to predicted yeild). • Datasets of sizes greater than 8.3 million rows cannot be executed in this machine due to out of memory problems when running Horizontal Set Square Distance (HSSD).
20th International Conference on CATA 2005 Timing and Scalability Results Observation 1 (Continue) This table shows the comparison of time required for loading the vertical data structure to memory and one time root count operations for VSSD, and loading horizontal records to memory needed for HSSD.
20th International Conference on CATA 2005 Timing and Scalability Results Observation 2 This table shows the algorithm’s timing and scalability performances when they are executed on different machines.
20th International Conference on CATA 2005 Timing and Scalability Results Observation 2 (Continue)
20th International Conference on CATA 2005 Conclusion • Vertical Set Square Distance (VSSD) is fast and accurate way to compute total variation for classification, clustering and rule mining, and scale well to very large datasets compare to the traditional horizontal approach (HSSD). • The complexity of the VSSD is O(d * b2) where d is the number of dimensions and b is the maximum bit-width of the attributes (depends on the width of the data set only). • The VSSD is very fast because of the independency of root counts of P-tree operands to target vector a which allows the pre-computation of counts once in advance, during the construction of the P-tree.
20th International Conference on CATA 2005 Future Work • Comprehensive study on the Vertical Set Square Distance (VSSD) in many data mining tasks, i.e. classification, clustering or outlier detection. • For classification using VSSD in the voting phase has already been shown to greatly accelerate the assignment of class since the calculation of class votes can be done entirely in one computation without having to visit each individual point, as is the case of the horizontal-based approach.
20th International Conference on CATA 2005 Thank You…