400 likes | 557 Views
PARSYMONY: Scalable Parallel Data Mining. Alok N. Choudhary Northwestern University (ACK: Harsha Nagesh (Bell Labs) and Sanjay Goil (Sun). Outline. Overview of PARSIMONY MAFIA (Subspace clustering) pMAFIA (Parallel Subspace Clustering) Performance Results PARSIMONY
E N D
PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha Nagesh (Bell Labs) and Sanjay Goil (Sun) choudhar@ece.nwu.edu 1
Outline • Overview of PARSIMONY • MAFIA (Subspace clustering) • pMAFIA (Parallel Subspace Clustering) • Performance Results • PARSIMONY • multidimensional data analysis • parallel classification • Summary choudhar@ece.nwu.edu 2
Overview of Knowledge Discovery Process choudhar@ece.nwu.edu 3
PARSIMONY Overview choudhar@ece.nwu.edu 4
MAFIA: Subspace Clustering for High-Dimensional Data Sets Clustering and subspace clustering Base MAFIA Algorithm pMAFIA (parallelization) Performance Results choudhar@ece.nwu.edu 5
Clustering Multi-dimensional Dimensional Data Sets Determine range of attributes in each dimension of the cluster(s) choudhar@ece.nwu.edu 7
Issues to be addressed • Basic algorithm computational optimizations • Scalability with database size (out of core data sets) • Scalability with the dimensionality of data • Efficient Parallelization • Recognition of arbitrary shaped clusters choudhar@ece.nwu.edu 8
Related Work • Partition based Clustering: • User specified k representative points taken as cluster centers and points assigned to cluster centers • k-means, k-mediods, CLARANS (VLDB 94), BIRCH (SIGMOD 96),.. • Consider clustering partitioning of points • Hierarchical Clustering: • Each point is a cluster. Merge similar points together gradually. • CURE (SIGMOD 98) (use sampling) • Categorical Clustering: • Clustering of categorical data e.g automobile sales data: color, year, model, price, etc • Best suited for non-numerical data • CACTUS (KDD 99), STIRR (VLDB 98) choudhar@ece.nwu.edu 9
Related Work • Density and Grid Based Clustering: • Clusters are high density regions than its surroundings • WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98) • Number of subspaces is exponential in the data dimensionality • Multidimensional space divided into grids. The histogram in each hyper-rectangle is found. Grid regions with a significant histogram value are cluster regions. • Post-processing done to grow the connected cluster regions. • Fine grid size results in explosion in the number hyper-rectangles, coarser grids fail to detect clusters. • Correct Grid Size is very critical ! choudhar@ece.nwu.edu 10
Subspace Clustering • Observation: “If a collection of points S is a cluster in a k-dimensional space, then S is also a part of a cluster in any (k-1) dimensional projection of the space” • Algorithm : Growing clusters…candidate dense units in any k dimensions are obtained by merging dense units in (k-1) dimensions which share any (k-2) dimensions. • Ex: ( {a1,b7,d9}, {b7,c8,d9} ) --> {a1,b7,c8,d9} • Candidate dense units are populated by a pass on the data set and the dense units are found out in each dimension. • Dense units found are combined to form candidate dense units. • Algorithm terminates when no more candidate dense units found. choudhar@ece.nwu.edu 12
Automatic Grid fitting based on data distribution MAFIA : Merging of Adaptive Finite Intervals ! Optimal Bins in each dimension leads to very few units in the grid (candidate dense units) (a) : CLIQUE (b) : MAFIA Adaptive Grids (reducing computation in practice) choudhar@ece.nwu.edu 13
Base MAFIA Algorithm • Divide each dimension into very fine regions. • Compute histogram in these regions along every dimension. • Set the value of a sliding window to the maximum histogram value in the window. • Adjacent units which have nearly same histogram values are merged together to form larger bins. • Threshold of each bin formed is computed automatically. • A bin having a histogram value much greater (by a factor 2~3) than that of equi-distribution of data is DENSE. choudhar@ece.nwu.edu 14
MAFIA Algorithm (merging dense units) • Algorithm : Candidate dense units in any k dimensions are obtained by merging dense units in (k-1) dimensions which share any(k-2) dimensions. • Ex: ( {a1,c7,b8}, {c7,b8,d9} ) --> {a1,c7,b8,d9} choudhar@ece.nwu.edu 16
CLIQUE: CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share first (k-2) dimensions. MAFIA: CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share any(k-2) dimensions. Huge CDU Set Non Cluster Dims reported CLIQUE Fixed Grids Fixed Grids first (k-2) algo Huge Search Space Data Set much reduced CDU Set Adaptive Grids Correct Cluster Dims reported any (k-2) algo MAFIA Reduced Search Space Dimensions aware of data distribution choudhar@ece.nwu.edu 17
Parallel MAFIA • pMAFIA: Parallel Subspace Clustering • Scalable in data size and number of dimensions Grid and Density based clustering algorithm • Parallelization provides speedup for the subspace clustering algorithm. choudhar@ece.nwu.edu 18
pMAFIA Algorithm: • Each processor reads part of the data in a Data Parallel fashion and constructs histogram in every dimension. • // Data read in chunks (out of core) of data • Reduce communication to obtain global histogram. All processors build Adaptive grids using the histogram. • // Each bin formed is a Candidate Dense Unit. • Current Dimension k = 1 • while (no more dense units found) • if( k > 1) • Build Candidate Dense units(); • Populate the candidate dense units in a data parallel fashion and in chunks (out-of-core) of B records. • Reduce communication to obtain global CDU population. • Identify the dense units (); • Build dense unit data structures(); // for the next higher dimension choudhar@ece.nwu.edu 19
Current Dimension(k) = 3. Build-Candidate-Dense-Units choudhar@ece.nwu.edu 20
Build Candidate Dense Units • Each dense unit is compared with every other dense unit to form CDUs, resulting in an O(Ndu2) algorithm.(Ndu-number of dense units) • For large values of Ndu CDUs built in parallel. • Processors 0,..,(p-1) work in parallel on parts of total Ndu dense units. Processor k compares dense units between Ni and Ni+1 with all the other dense units; for optimal task partitioning we have • Identical CDUs generated during the process need to be discarded. • Each generated CDU compared with every other CDU to identify similar ones resulting in O(Ncdu2) algorithm. choudhar@ece.nwu.edu 22
Parallelization (Building CDUs) Total work done by Pi is shown choudhar@ece.nwu.edu 24
pMAFIA : ANALYSIS • Data parallelism in populating the CDUs in every dimension : effective for massive data sets. • Gains of task parallelism realized when data contains large number of dense units (clusters). • A ‘k’ dimensional dense unit allocated just 2k bytes of memory, k bytes for dimensions and k for bin indices. • Data structures in form of linear arrays of bytes: communicate very small message buffers, space optimization. • Although bottom-up algorithm is exponential in the data dimension, for low subspace coverage, with use of Adaptive grids and parallel formulation very promising results. • k dimension of the highest dimension dense unit, we explore all possible subspaces of these k dimensions ==> O(ck) • For k passes over the data set O( (N/pB) * k*Tio), • N-total number of records, p-processors, • B- records per chunk, Tio-I/O access time for a block of B records. • Communication overhead results in O(Tcomm * S * p * k), • Tcomm - constant for communication, S- size of message exchanged,. • O(ck + (N/pB) * k*Ti + Tcomm * S * p * k ) choudhar@ece.nwu.edu 25
pMAFIA Outline • Clustering and subspace clustering • Base MAFIA Algorithm • pMAFIA (parallelization) • Performance Results • Adaptivity performance • Scalability with data set size. • Scalability with data dimensionality. • Scalability with cluster dimensionality. • Some Real world data sets. choudhar@ece.nwu.edu 26
Advantage of Adaptive Grids • 300,000 records in a 15 Dimension space with 1 cluster of 5 dimensions. • A speedup of 80 obtained over CLIQUE. • CLIQUE failed to produce results with our modified CDU generation algorithm even in 2 hours on 16 processors. • This relatively small data set mined in 32 seconds on 1 processor choudhar@ece.nwu.edu 29
Scalability with Data Set Size • 20 Dimension data with 5 clusters in 5 different subspaces • data sets :up to 11.8 million records • Clusters detected in just about 3 minutes on 16 processors ! • Almost Linear with the increase in the data set size (because most time in scanning) choudhar@ece.nwu.edu 30
Parallelization (on IBM SP2) • 30 Dimension data with 8.3 million records, 5 clusters each in a 6 dimension subspace. • Near linear speedups • Negligible Communication overheads (<1%) choudhar@ece.nwu.edu 31
Data Dimensionality • 250,000 records, 3 clusters in different 5 dimensional subspaces. • Near linear behavior with data dimensionality : Algorithm depends on the maximum number of dimensions in a clusters and not on the data dimensionality. choudhar@ece.nwu.edu 32
Cluster Dimensionality • 50 dimension data with 1 cluster, 650,000 records; cluster dimension from 3 to 10. • Behavior in line with the order of the algorithm : increases with subspace coverage of the cluster. choudhar@ece.nwu.edu 33
Scalability on Movie Data • 72,916 users rated 1628 movies in 18 months: 2.8 Million ratings • 4D data: {user-id, movie-id, weight, score} • Discovered seven interesting 2d clusters! choudhar@ece.nwu.edu 34
Other Data Sets • One day Ahead Prediction of DAX (German Stock Exchange) • DAX prediction data set was based on a 12 input time series like stock indices, bond indices, gold prices, etc • 22 dimensions with 2757 records: Major gains from task parallelism • Mined clusters in 8.16 seconds on 8 processors. • Unique clusters discovered in 3,4,5 and 6 dimensional subspaces. • Ionosphere data: (UCI repository) • Radar data collected in Goose Bay, Labrador • 34 dimension data, 351 records. • Discovered unique clusters only in 3 and 4 dimensional sub spaces. choudhar@ece.nwu.edu 35
Summary and Conclusions for pMAFIA • MAFIA : unsupervised subspace clustering algorithm • Introduced the adaptive grids formulation • First parallel subspace clustering algorithm for massivedata sets • Incorporates both task and data parallelism. . • pMAFIA : Scalable and Parallel Implementation in data size, no of dimensions choudhar@ece.nwu.edu 37
PARSIMONY Overview choudhar@ece.nwu.edu 38
Sparse Data Structures and Representations choudhar@ece.nwu.edu 39
Using OLAP Framework for choudhar@ece.nwu.edu 41