1 / 24

Enhancing Scientific Data Mining with Multi-level Cluster Analysis Toolkit

Addressing bottlenecks in data management for large scientific datasets through ASPECT, a multi-level toolkit for dynamic data analysis. Leverage existing work, integrate with SDM tasks, collaborate with scientists, and iterate development.

kloretta
Download Presentation

Enhancing Scientific Data Mining with Multi-level Cluster Analysis Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-agent based High-Dimensional Cluster AnalysisSciDAC SDM-ISIC KickoffMeetingJuly 10-11, 2001 Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory

  2. Science driven Bottlenecks • Data management and data mining algorithms:not scalable to petabytes of scientific data • Retrieving data subsets from storage systems: too slow, especially for tertiary storage • Transferring large datasets between sites is inefficient • Navigating between heterogeneous, distributed data sources very user intensive • I/O techniques: too low access rate To improve the transfer of large datasets Major Focus: • To implement effective high-bandwidth transfers (Randy Burris) Approaches: • To minimize the amount of data transferred

  3. Minimizing the amount of scientific simulation data transfer – State of the Art • Data compression utilities (zip, compress, etc.): • large overheads • modest compression rates • Post-processing data analysis tools (like PCMDI): • Scientists must wait for the simulation completion • can use lots of CPU cycles on long-running simulations • can use up to 50% more storage and require unnecessary data transfer for data-intensive simulations • Simulation monitoring tools: • interference with simulations • lack of flexibility

  4. Improvements through — Multi-level data minimization mechanisms • Simulation level Data stream  not simulation  monitoring tools for: • “Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue • Filtering runs to decide whether to transfer to a central archive, keep locally, or delete • Comparative analysis level Application-specific search engines for: • Simulation data comparison, esp. against archived databases • Distributed simulation data query, search, and retrieval • In-depth analysis level Application-specific inference engines for: • Inferring rules relating fragments in two or more simulation outputs • New scientific discoveries

  5. How we will address these needs • Our Approach:Develop ASPECT(Adaptable Simulation Product Exploration via Clustering Toolkit) that includes: • Dynamic first-look multivariate time series miner (Level I) • Distributed time-series query, search, and retrieval engine (Level II) • Time-series-based rules inference engine (Level III) • Our Strategy: • Leverage existing work • Expand our prior work • Integrate with other SDM tasks • Work closely with application scientists • Develop ASPECT in an iterative fashion

  6. Our work will be leveraging • Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b] • Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00] • Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]

  7. Distributed Scientific Data Mining Research(funded under Probe/MICS) Motivation Big picture SDM-ETC related effort Relevance to our task: Levels II and III Limitations w.r.t. to our task: Enabling Technology research not application-specific

  8. Motivation for Scientific Data Mining Research under Probe • Existing data mining tools have limited applicability to the emergingscientificdata sets that are: • Massive (terabytes to petabytes) • Existing methods do not scale in terms of time, storage, number of dimensions. • Need scalable data analysis algorithms. • Distributed(e.g., across computational grids, multiple files, disks, tapes) • Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns). • Need distributed data analysis algorithm. • Dynamic • Existing methods work with static datasets. Any changes require complete re-computation. • Need dynamic (updating & downdating) techniques. • High-dimensional • Usual assumptions about homogeneity or ergodicity can not be made • Need segmented dimension reduction methods.

  9. Our Approach – Distributed agents and peer-to-peer negotiation • Strategy • to perform data mining in a distributed and recursive fashion • with reasonable data transfer overheads • Key idea • Generate local components using distributed agents • Merge these components into a global system via peer-to-peer agents’ collaboration and negotiation • Requirements for Resulting System • Qualitative comparability • Computational complexity reduction • Scalability • Communication acceptability • Flexibility (in the choice of a local algorithm) • Visual representation sufficiency

  10. Distance Matrix 75% 40% A 0 .6 C D .25 .7 B A E B .75 A .25 C D E 60% E .8 B .6 0 .25 .5 .70 .4 D C .6 Spanning Tree with Dissimilarity Measures Dendrogram Background: Hierarchical Clustering

  11. SDM-ETC Tie-in: Distributed Hierarchical Clustering • Given: • A data set with N d-dimensional data items distributed across multiples data sites • Task: Determine a hierarchical decomposition of this dataset • Application of Clustering: • Database Management • Multi-dimensional indexing • Data Mining • and…. Problem Description:

  12. Local Dendrogram Local Dendrogram Local Dendrogram Generate local dendrograms Distributed dendrograms Transmit local dendrograms Centralized dendrograms Merge local dendrograms RACHET Global Dendrogram Improve Comparable Quality? Increase k Reconstruct Geometry for visualization (optional) Global Dendrogram RACHET: Distributed Clustering Algorithm Control flow of RACHET

  13. Nc– number of data points in the cluster • – square norm of centroid • – radius of the cluster • – sum of centroid components • – minimum centroid component • – maximum centroid component Features: • vs. space cost • Sufficient for efficiently calculating all measurements involved in making clustering decisions • Sufficient for visualization is a cluster centroid of Nc points Centroid Descriptive Statistics -summarized cluster representation QuestionHow many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?

  14. Merging Theorem: Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and : S1 C1 O C2 S2 Updating Descriptive Statistics

  15. Squared Euclidean Distance: transmission cost Lower and Upper Bounds: transmission cost Euclidean Distance Approximation

  16. RACHET Performance Analysis:linear in time, space and transmission |S|<<N and k<<N O(N)

  17. Analysis of Large Scientific Datasets Focus: Univariate time series data Applications: ARM, EEG Relevance to our task: Level III Limitations w.r.t. our task: No support of dynamic & distributed time series No support of multivariate time series

  18. Local Models For Global Analysis and Comparison of Data Series • Strategy • Segment series • Model the usual to find the unusual • Key ideas • Fit simple local models to segments • Use parameters for global analysis and monitoring • Resulting system • Detects specific events (targeted or unusual) • Provides a global description of one or several data series • Provides data reduction to parameters of local model

  19. From Local Models to Annotated Time Series Segment series (100 obs) Fit simple local model ( c0, c1, c2, ||e||, ||e||2) Select extreme (10%) Cluster extreme (4) Map back to series

  20. Statistical Downscaling for Climate Focus: Image time series Application: Climate Relevance to our task: Levels I and II Limitation w.r.t. our task: Works as a post-processing tool

  21. Climate Downscaling Contains Several Post-Processing Tools

  22. Trend and Periodic Components Provide a Concise Description of Model Run Filter periodic and trend components Compute EOFs Monitor model run

  23. Summary of where efforts are needed • Research: • Multivariate time series datasets • Dynamic versions of time series processing & analysis tools • Application-specific distributed & dynamic clustering • Application-specific rules inference algorithms • Implementations: • ASPECT’s framework • Simulation data monitoring engine: • with pluggable user-driven data analysis modules • with “any-time”, “real-time” not post-processing • with no or very little interference with simulation • Simulation data query, search, & retrieval engine • Simulation data rules inference engine • A lot of integration work…

  24. 4) Distributed, heterogeneous data access d) Dataset • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) Federation • Knowledge-based federation of heterogeneous databases (SDSC) Level 2) Access optimization 1) Storage and retrieval of 3) Data mining and Very large datasets discovery of access patterns of distributed data • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access • High-dimensional indexing techniques (LBNL) c) Dataset to tertiary storage Level (LBNL, ORNL) Multi-agent high-dimensional cluster analysis (ORNL) • • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File • Low level API for grid I/O Level (ANL) Dimension reduction and sampling (LLNL, LBNL) • • Parallel I/O: improving parallel a) Storage access from clusters (ANL, NWU) Adaptive file caching in a distributed system (LBNL) Level • Optimization of low-level data storage, retrieval and transport (ORNL) • [ Grid Enabling Technology] • 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU) Integration with other SDM-ETC tasks

More Related