120 likes | 131 Views
Explore FREERIDE, a robust system for rapid implementation of high-performance data mining algorithms, designed for large scientific datasets. Enhance scalability and accelerate the scientific data mining process efficiently.
E N D
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information Sciences Ohio State University
Scientific Data Mining Problem • Datasets used for scientific data mining are large – particularly from simulations • Our understanding of what algorithms and parameters will give desired insights is limited • Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process
Project Overview • FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system • Already demonstrated for a variety of standard mining algorithms • Working for feature analysis and mining of simulation data currently
FREERIDE offers: • The ability to rapidly prototype a high-performance mining implementation • Distributed memory parallelization • Shared memory parallelization • Ability to process large and disk-resident datasets • Only modest modifications to a sequential implementation for the above three
Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware Key Observation from Mining Algorithms While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }
Performance of Shared Memory Parallelization K-means clustering
Performance on Cluster of SMPs Apriori Association Mining
SPIES On (a) FREERIDE • Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) • Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume • Does not require sorting of data, or partitioning and writing-back of records
Applying FREERIDE for Scientific Data Mining • Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. • A feature is a region of interest in a dataset • A suite of algorithms for extracting and tracking them
A Feature Analysis Algorithm Data Transform Tour Grid Operator Aggregate Classify Points Denoise Track Rank Catalog ROIs Classify-Aggregate
Ongoing Work – Parallelization Using FREERIDE • Most of the steps involve generalized reductions - supported well in FREERIDE • Extensions to FREERIDE required for aggregation and tracking steps • Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm