130 likes | 165 Views
Lionel F. Lovett, II Advisors: George Ostrouchov and Houssain Kettani Computer Science and Mathematics Division Oak Ridge National Laboratory. Summer 2005. RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering.
E N D
Lionel F. Lovett, II Advisors: George Ostrouchov and Houssain Kettani Computer Science and Mathematics Division Oak Ridge National Laboratory Summer 2005 RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Data Mining Large Databases Number of Items Number of Attributes (high-dimensionality) with items Visualization Requires low dimensional views (2 or 3) Structure discovery Patterns Clusters Fast similarity searching Images, video, documents, character recognition, face recognition, DNA sequences Data Reduction Why Dimension Reduction?
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering RobustMap Uses Distances and Mimics PCA Like FastMap Faloutsos and Lin (1995) (FastMap) • Choose two very distant points as principal axis • Project onto orthogonal hyperplane • Repeat • Each axis O(n), given distances • Distances updated using cosine law as needed • Result is a mapping as well as the transformation to map new items
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Projection to Pivot Axis and to Orthogonal Hyperplane Given pivot axis ab FastMap computes coordinates along axis and projections onto the orthogonal hyperplane. b b db,y z da,b y y cy a da,y y’ a z’ dy’,z’
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering OUTLIERS! Outliers are points that are not closest on average to the other members of their cluster. When Selecting points based on distance, FastMap considers all the points of a dataset. By including outliers, FastMap isn’t robust. Problems with FastMap?
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering FastMap Pivot Pair: Choosing Outliers Axis does not represent majority of data
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering RobustMap: Clustering and Excluding Outliers Ratio Function • Uses only distances from pivots (2nd and 3rd) • Computes ratios: data fraction / probability of data • Looks for splits according to ratio threshold • Discards smaller portion. • Compute n distances from random object • Take point of largest distance • Repeat • Clustering • Estimate distance distribution from two extreme points • Find probability of extreme points • Exclude most extreme cluster of low probability points • Finish projection using remaining points • Diagnostic histogram and cluster plots
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Dataset Generator • Generates clustered data from a mixture of multivariate normal densities • There are five parameters • Number of dimensions • Number of clusters • Cluster variability • Cluster mixing proportions • Seed for random number generator • Other RobustMap parameters • Number of dimensions to extract • Quantile of trimmed max • Ratio threshold for outlying cluster extraction
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering RobustMap identifies and excludes outlying clusters RobustMap performs dimension reduction RobustMap exploits robust statistics RobustMap exploits fast machine learning algorithms (runtime O(nk)) Results
Decomposition of Climate Model Run Datawith PCA (EOF) Amplitude-in-time plots Amplitude-in-time plots RobustMap 135 year CCM3 run at T42 resolution CO2 increase to 3x Average Monthly Surface Temperature 1620 x 2500 matrix (Putman, Drake, Ostrouchov, 2000) PC 1 RM 1 PC 2 RM 2 . . . + + + 1000 x FASTER ! + + + PC 1 PC 2 PC 3 PC 4 PC 1620 RM 1 RM 2 RM 3 RM 4 PC 3 RM 3 Concise 4-d summary of 135 year run Concise 4-d summary of 135 year run PC 4 RM 4 Winter warming more severe than summer warming Winter warming more severe than summer warming Image vector
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Ratio Compute threshold from probability theory Create loop for remaining clusters Develop better probability theory for RobustMap Add application context visualization Future Plans
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Searching Images and multimedia databases String databases (spelling, typing and OCR error correction) Medical databases Data Mining and Visualization Medical databases (ECGs, X-rays, MRI brain scans) Demographic Data Time Series Business, Commerce, and Financial Data Climate, Astrophysics, Chemistry, and Biology Data Applications