1 / 29

Scientific Data Mining

Scientific Data Mining. Chandrika Kamath October 7, 2008 Lawrence Livermore National Laboratory. Goal: solving the problem of data overload. Use scientific data mining techniques to analyze data from various SciDAC applications

ronny
Download Presentation

Scientific Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Data Mining Chandrika Kamath October 7, 2008 Lawrence Livermore National Laboratory

  2. Goal: solving the problem of data overload • Use scientific data mining techniques to analyze data from various SciDAC applications • Techniques borrowed from image and video processing, machine learning, statistics, pattern recognition, … • Leveraging the Sapphire scientific data mining software, with functions added as required • Contributors to the SciDAC part: Erick Cantú-Paz, Imola K. Fodor, Siddharth Manay, Nicole S. Love

  3. Overview of Sapphire

  4. Sapphire: scientific data mining(1998-2008) • We analyze science data from experiments, observations, and simulations: massive *and* complex • Sapphire has a three-fold focus • researchin robust, accurate, scalable algorithms • modular, extensible software • analysisof data from practical problems • Funded through DOE NNSA, LLNL LDRD, SDM SciDAC Center, GSEP SciDAC project https://computation.llnl.gov/casc/sapphire

  5. Scientific data mining - from a Terabyte to a Megabyte Raw Data Target Data Preprocessed Data Transformed Data Patterns Knowledge Data Preprocessing Pattern Recognition Interpreting Results De-noising Object - identification Feature- extraction Normalization Dimension- reduction Data Fusion Sampling Multi-resolution analysis Classification Clustering Regression Visualization Validation An iterative and interactive process

  6. The Sapphire system architecture: flexible, portable, scalable RDB: Data Store Decision trees Neural Networks SVMs k-nearest neighbors Clustering Evolutionary algorithms Tracking …. De-noise data Background- subtraction Identify objects Extract features Features Data items FITS BSQ PNM View . . . Display Patterns Sample data Fuse data Multi-resolution- analysis Normalization Dimension- reduction Sapphire Software Public Domain Software Sapphire & Domain Software Components linked by Python User Input & feedback US Patents 6675164 (1/04), 6859804 (2/05), 6879729 (4/05), 6938049 (8/05), 7007035 (2/06), 7062504 (6/06)

  7. The modular software is used to meet the needs of different applications Command-line Interface Graphical Interface Remote Sensing … Fragmentation of materials Plasma Physics Sapphire Software Drivers, support functions Drivers, support functions … Astronomy Video surveillance Climate Simulations Sapphire libraries Scientific data processing, dimension reduction, pattern recognition Sim/Expt comparison Fluid mix, turbulence In this talk, I focus only on SciDAC applications

  8. SciDAC achievements

  9. Application 1: Separating signals in climate data • We used independent component analysis to separate El Niño and volcano signals in climate simulations • Showed that the technique can be used to enable better comparisons of simulations Collaboration with Ben Santer (LLNL)

  10. Application 2: Identifying key features for EHOs in DIII-D • We used dimension reduction techniques from statistics and machine learning to identify key features associated with edge harmonic oscillations in the DIII-D tokamak • H-mode is the preferred mode of operation, but associated with ELMs – which can damage components of the tokamak • A quiescent H-mode has been observed; associated with EHOs – need to understand EHOs better • The key variables identified are being used to understand the cause of EHOs; the software has been licensed to GAT Collaboration with Keith Burrell and Mike Walker (GAT)

  11. The data is from sensors in DIII-D • 700 experiments, each lasting 6 seconds • Each 50ms window of an experiment is assigned a low or high EHO-ness label • Each window is described by 37 sensor measurements • Data cleanup • discard windows with at least one missing sensor value • use median value of variable in window • discard windows with at least one variable in the top or bottom percentile of its range • resulted in 41818 instances

  12. Challenge: no preconceived notion of which sensor values are important • Data cleanup: prevents outliers from influencing results • Use different feature selection methods to gain confidence • PCA filter – use magnitude of coefficients • Distance filter – Kullback-Liebler distance between histograms • Stump filter • Chi-square filter • Boosting approach • Introduce a “noise” feature

  13. We evaluated the features using a naïve Bayes classifier

  14. We also considered the top ten features selected by the methods

  15. Several features are common across different methods Multiple methods provide confidence in results

  16. Application 3: Classifying and characterizing orbits in Poincaré plots • I am using techniques from scientific data mining to assign one of four labels to an orbit and extract characteristics of separatrix and island chain orbits. Collaboration with J. Breslau, N. Pomphrey, D. Monticello(PPPL), S. Klasky(ORNL)

  17. There are four classes of orbits – based on the location of the initial point Island chain Quasi-periodic Stochastic Separatrix

  18. Challenge: There is a large variation in the orbits of any one class quasiperiodic orbits

  19. Variation in island-chain orbits

  20. Variation in separatrix orbits 1000 points 5000 points

  21. How do we extract representative features for an orbit? • Variation in the data makes it difficult to identify good features and extract them in a robust way • Issues with labels assigned to orbits • Next steps: characterizing island chains and separatrix orbits Identifying missing orbits

  22. Application 4: Tracking blobs in fusion plasma • We are using image and video processing techniques to identify and track blobs in experimental data from NSTX to validate and refine theories of edge turbulence t t+1 t+2 Denoised original After removal of background Detection of blobs Collaboration with S. Zweben, R. Maqueda, and D. Stotler (PPPL)

  23. Goal: understand the turbulence which causes leakage of the plasma • Requirements for fusion – high temperature and confined plasma • Fine-scale turbulence at the edge causes leakage of plasma from the center to the edge • Loss of confinement • Heat loss of plasma • Erosion or vaporization of the containment wall

  24. The Gas-Puff Imaging diagnostic is used to view the coherent structures • Turbulence in the form of density filaments highly elongated in the direction of the magnetic field • Inject a gas cloud in the torus, and capture the intersection of the cloud with the filament using a camera which views the filament along the magnetic field GPI view 16x32 cm

  25. Data from GPI in NSTX • PSI-5 camera capture GPI images • 300 frame sequences taken at 250,000 frames/sec • 16-bit images with 64x64 pixels

  26. Why is this difficult? • coherent structures are poorly understood empirically and not understood theoretically • no known ground-truth • noisy images • variation within a sequence

  27. Example frames to segment (sequence 113734: frames 1-50)

  28. We are investigating several image segmentation methods • Immersion-Based: basic immersion, constrained watershed, watershed merging • Region Growing: seeded region growing, seed competition • Model-Based: 2-D Gaussian fit • Challenges: how do we select the parameters in an algorithm, how do we handle the variability in the data especially for longer sequences, how do the choices of algorithms and parameters influence the “science”, … Ongoing work: see AHM 2007 slides

  29. Vision for the future • Meeting algorithm requirements of current applications • Robust extraction of feature vectors (orbit characterization) • Improved algorithms for image analysis (blob characterization) • Uncertainty quantification (how much can we trust the result?) • Meeting the science goals • Classification and characterization of Poincaré plots • Tracking the blobs in NSTX • Extraction of coherent structures in fluid and particle data and their non-linear interactions (GSEP) • Addressing requests from new applications – SNS, materials science, combustion, power grid, … • Deploy as requested

More Related