Enhancing Scientific Data Mining with Multi-level Cluster Analysis Toolkit

Multi-agent based High-Dimensional Cluster AnalysisSciDAC SDM-ISIC KickoffMeetingJuly 10-11, 2001 Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory

Science driven Bottlenecks • Data management and data mining algorithms:not scalable to petabytes of scientific data • Retrieving data subsets from storage systems: too slow, especially for tertiary storage • Transferring large datasets between sites is inefficient • Navigating between heterogeneous, distributed data sources very user intensive • I/O techniques: too low access rate To improve the transfer of large datasets Major Focus: • To implement effective high-bandwidth transfers (Randy Burris) Approaches: • To minimize the amount of data transferred

Minimizing the amount of scientific simulation data transfer – State of the Art • Data compression utilities (zip, compress, etc.): • large overheads • modest compression rates • Post-processing data analysis tools (like PCMDI): • Scientists must wait for the simulation completion • can use lots of CPU cycles on long-running simulations • can use up to 50% more storage and require unnecessary data transfer for data-intensive simulations • Simulation monitoring tools: • interference with simulations • lack of flexibility

Improvements through — Multi-level data minimization mechanisms • Simulation level Data stream  not simulation  monitoring tools for: • “Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue • Filtering runs to decide whether to transfer to a central archive, keep locally, or delete • Comparative analysis level Application-specific search engines for: • Simulation data comparison, esp. against archived databases • Distributed simulation data query, search, and retrieval • In-depth analysis level Application-specific inference engines for: • Inferring rules relating fragments in two or more simulation outputs • New scientific discoveries

How we will address these needs • Our Approach:Develop ASPECT(Adaptable Simulation Product Exploration via Clustering Toolkit) that includes: • Dynamic first-look multivariate time series miner (Level I) • Distributed time-series query, search, and retrieval engine (Level II) • Time-series-based rules inference engine (Level III) • Our Strategy: • Leverage existing work • Expand our prior work • Integrate with other SDM tasks • Work closely with application scientists • Develop ASPECT in an iterative fashion

Our work will be leveraging • Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b] • Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00] • Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]

Distributed Scientific Data Mining Research(funded under Probe/MICS) Motivation Big picture SDM-ETC related effort Relevance to our task: Levels II and III Limitations w.r.t. to our task: Enabling Technology research not application-specific

Motivation for Scientific Data Mining Research under Probe • Existing data mining tools have limited applicability to the emergingscientificdata sets that are: • Massive (terabytes to petabytes) • Existing methods do not scale in terms of time, storage, number of dimensions. • Need scalable data analysis algorithms. • Distributed(e.g., across computational grids, multiple files, disks, tapes) • Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns). • Need distributed data analysis algorithm. • Dynamic • Existing methods work with static datasets. Any changes require complete re-computation. • Need dynamic (updating & downdating) techniques. • High-dimensional • Usual assumptions about homogeneity or ergodicity can not be made • Need segmented dimension reduction methods.

Our Approach – Distributed agents and peer-to-peer negotiation • Strategy • to perform data mining in a distributed and recursive fashion • with reasonable data transfer overheads • Key idea • Generate local components using distributed agents • Merge these components into a global system via peer-to-peer agents’ collaboration and negotiation • Requirements for Resulting System • Qualitative comparability • Computational complexity reduction • Scalability • Communication acceptability • Flexibility (in the choice of a local algorithm) • Visual representation sufficiency

Distance Matrix 75% 40% A 0 .6 C D .25 .7 B A E B .75 A .25 C D E 60% E .8 B .6 0 .25 .5 .70 .4 D C .6 Spanning Tree with Dissimilarity Measures Dendrogram Background: Hierarchical Clustering

SDM-ETC Tie-in: Distributed Hierarchical Clustering • Given: • A data set with N d-dimensional data items distributed across multiples data sites • Task: Determine a hierarchical decomposition of this dataset • Application of Clustering: • Database Management • Multi-dimensional indexing • Data Mining • and…. Problem Description:

Local Dendrogram Local Dendrogram Local Dendrogram Generate local dendrograms Distributed dendrograms Transmit local dendrograms Centralized dendrograms Merge local dendrograms RACHET Global Dendrogram Improve Comparable Quality? Increase k Reconstruct Geometry for visualization (optional) Global Dendrogram RACHET: Distributed Clustering Algorithm Control flow of RACHET

Nc– number of data points in the cluster • – square norm of centroid • – radius of the cluster • – sum of centroid components • – minimum centroid component • – maximum centroid component Features: • vs. space cost • Sufficient for efficiently calculating all measurements involved in making clustering decisions • Sufficient for visualization is a cluster centroid of Nc points Centroid Descriptive Statistics -summarized cluster representation QuestionHow many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?

Merging Theorem: Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and : S1 C1 O C2 S2 Updating Descriptive Statistics

Squared Euclidean Distance: transmission cost Lower and Upper Bounds: transmission cost Euclidean Distance Approximation

RACHET Performance Analysis:linear in time, space and transmission |S|<<N and k<<N O(N)

Analysis of Large Scientific Datasets Focus: Univariate time series data Applications: ARM, EEG Relevance to our task: Level III Limitations w.r.t. our task: No support of dynamic & distributed time series No support of multivariate time series

Local Models For Global Analysis and Comparison of Data Series • Strategy • Segment series • Model the usual to find the unusual • Key ideas • Fit simple local models to segments • Use parameters for global analysis and monitoring • Resulting system • Detects specific events (targeted or unusual) • Provides a global description of one or several data series • Provides data reduction to parameters of local model

From Local Models to Annotated Time Series Segment series (100 obs) Fit simple local model ( c0, c1, c2, ||e||, ||e||2) Select extreme (10%) Cluster extreme (4) Map back to series

Statistical Downscaling for Climate Focus: Image time series Application: Climate Relevance to our task: Levels I and II Limitation w.r.t. our task: Works as a post-processing tool

Climate Downscaling Contains Several Post-Processing Tools

Trend and Periodic Components Provide a Concise Description of Model Run Filter periodic and trend components Compute EOFs Monitor model run

Summary of where efforts are needed • Research: • Multivariate time series datasets • Dynamic versions of time series processing & analysis tools • Application-specific distributed & dynamic clustering • Application-specific rules inference algorithms • Implementations: • ASPECT’s framework • Simulation data monitoring engine: • with pluggable user-driven data analysis modules • with “any-time”, “real-time” not post-processing • with no or very little interference with simulation • Simulation data query, search, & retrieval engine • Simulation data rules inference engine • A lot of integration work…

4) Distributed, heterogeneous data access d) Dataset • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) Federation • Knowledge-based federation of heterogeneous databases (SDSC) Level 2) Access optimization 1) Storage and retrieval of 3) Data mining and Very large datasets discovery of access patterns of distributed data • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access • High-dimensional indexing techniques (LBNL) c) Dataset to tertiary storage Level (LBNL, ORNL) Multi-agent high-dimensional cluster analysis (ORNL) • • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File • Low level API for grid I/O Level (ANL) Dimension reduction and sampling (LLNL, LBNL) • • Parallel I/O: improving parallel a) Storage access from clusters (ANL, NWU) Adaptive file caching in a distributed system (LBNL) Level • Optimization of low-level data storage, retrieval and transport (ORNL) • [ Grid Enabling Technology] • 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU) Integration with other SDM-ETC tasks

Enhancing Scientific Data Mining with Multi-level Cluster Analysis Toolkit

Enhancing Scientific Data Mining with Multi-level Cluster Analysis Toolkit

Presentation Transcript

Grid Computing

Demystifying the Hoopla Around Computer Science and Common Core Mathematics Standards

Part I: Introductory Materials Introduction to Parallel Computing with R

Counting Computer Science as both a Mathematics and Science credit in Wisconsin

Assessment Report Computer Science School of Science and Mathematics

7.5 Inclusion-Exclusion

Lynbrook Computer Science

CS529 Multimedia Networking

Computer Science and Computational Science

Computer Science Department

Scientific Data Management Center

ASPECT : Adaptable Simulation Product Exploration and Control Toolkit

James “Jeeembo” Kohl Computer Science and Mathematics Division Oak Ridge National Laboratory

Chiba City

Part I: Introductory Materials Introduction to R

Mathematics and Computer Science

Department of Mathematics, Statistics and Computer Science

K. Roche Future Technologies Group Computer Science and Mathematics Division

Realizing the Promise of Grid Computing

Scientific Data Management Center

Statistics and Data Sciences Group Computer Science and Mathematics Division

Thomas C. Schulthess Computer Science and Mathematics Division