300 likes | 429 Views
Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003. Sara J. Graves Director, Information Technology and Systems Center University Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology Research Center
E N D
Grid-enabled Collaborative Research ApplicationsInternet2 Member MeetingSpring, 2003 Sara J. Graves Director, Information Technology and Systems Center University Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology Research Center National Space Science and Technology Center 256-824-6064 sgraves@itsc.uah.edu http://www.itsc.uah.edu
“…drowning in data but starving for knowledge” Data glut affects business, medicine, military, science How do we leverage data to make BETTER decisions??? User Community Information
Collaborative Research Applications • Enabling Technologies for Collaborative Research • Grid-Enabled Data Mining Services • Interchange Technology Mark-ups • Collaboration Tools • Collaborative Research Applications on the Grid • TeraGrid Expeditions • Linked Environments for Atmospheric Discovery • Propulsion Research: Rocket Engine Advancement Project 2
Data Mining • Automated discovery of patterns, anomalies from vast observational data sets • Derived knowledge for decision making, predictions and disaster response • ADaM – Algorithm Development and Mining System http://datamining.itsc.uah.edu
Mining Environment: When,Where, Who and Why? • WHERE • User Workstation • Data Mining Center • GRID • WHEN • Real Time • On-Ingest • On-Demand • Repeatedly • WHO • End Users • Domain Experts • Mining Experts • WHY • Event • Relationship • Association • Corroboration • Collaboration Data Mining
Iterative Nature of the Data Mining Process KNOWLEDGE EVALUATION And PRESENTATION DISCOVERY MINING SELECTION And TRANSFORMATION CLEANING And INTEGRATION PREPROCESSING DATA
Input Output HDF HDF-EOS GIF PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) Intergraph Raster Others... GIF Images HDF-EOS HDF Raster Images HDF SDS Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images Others... ADaM Engine Architecture Preprocessed Data Patterns/ Models Results Data Translated Data Processing Preprocessing Analysis Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others...
Mining Environments Multilevel Mining (ADaM) • Complete System (Client and Engine) • Mining Engine (User provides its own client) • Application Specific Mining Systems • Operations Tool Kit • Stand Alone Mining Algorithms • Data Fusion Distributed/Federated Mining • Distributed services • Distributed data • Chaining using Interchange Technologies On-board Mining (EVE) • Real time and distributed mining • Processing environment constraints
Grid-Enabled Data Mining Services • Distributed researchers, data sources, storage and computational resources in a secure environment • ADaM data mining modules as Open Grid Services Architecture (OGSA) services
Data Mining / Earth Science Collaboration: Tropical Cyclone Detection Advanced Microwave Sounding Unit (AMSU-A) Data • Mining Plan: • Water cover mask to eliminate land • Laplacian filter to compute temperature gradients • Science Algorithm to estimate wind speed • Contiguous regions with wind speeds above a desired threshold identified • Additional test to eliminate false positives • Maximum wind speed and location produced Further Analysis Calibration/ Limb Correction/ Converted to Tb Knowledge Base Data Archive Hurricane Floyd Mining Environment Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis http://pm-esip.msfc.nasa.gov/cyclone
Data Mining / Earth Science Collaboration: Classification Based on Texture Features Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery • Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds • Comparison based on • Accuracy of detection • Amount of time required to classify
Parallel Version of Cloud Extraction • GOES images can be used to recognize cumulus cloud fields • Cumulus clouds are small and do not show up well in 4km resolution IR channels • Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors Master Slave 1 Slave 2 Slave 3 GOES Image Laplacian Filter Sobel Horizontal Filter Sobel Vertical Filter Energy Computation Energy Computation Energy Computation Energy Computation Classifier Cloud Image GOES Image Cumulus Cloud Mask • Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster
Data Mining / Earth Science Collaboration: Detecting Signatures • Detecting mesocyclone signatures from Radar data • Science Rationale: Mesocyclone is an indicator of Tornadic activity • Developing an algorithm based on wind velocity shear signatures • Improve accuracy and reduce false alarm rates
Data Mining / Space Science Collaboration:Boundary Detection and Quantification • Analysis of polar cap auroras in large volumes of spacecraft UV images • Scientific Rationale: • Indicators to predict geomagnetic storm • Damage satellites • Disrupt radio connection • Developing different mining algorithms to detect and quantify polar cap boundary Polar Cap Boundary
Mining Engine Input Modules Analysis Modules Output Modules Event/ Relationship Search System Event/ Relationship Search System Knowledge base Data Mining / BioInformatics Collaboration: Genome Patterns Text Pattern Recognition: Used to search for text patterns in bioscience data as well as other text documents. Scientists Mining Results: MCSs Genome DB
Sensor Data Characteristics • Many different formats, types and structures • Different states of processing ( raw, calibrated, derived, modeled or interpreted ) • Enormous volumes • Heterogeneity leads to data usability problems
Interchange Technologies: Accessing Heterogeneous Data The Problem DATA FORMAT 3 DATA FORMAT 1 DATA FORMAT 2 • Earth science data comes in: • Different formats, types and structures • Different states of processing (raw, calibrated, derived, modeled or interpreted) • Enormous volumes • Heterogeneity leads to data usability problems • One approach: Standard data formats • Difficult to implement and enforce • Can’t anticipate all needs • Some data can’t be modeled or is lost in translation • The cost of converting legacy data • A better approach: Interchange Technologies • Earth Science Markup Language FORMAT CONVERTER READER 1 READER 2 APPLICATION The Solution DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 ESML FILE ESML FILE ESML FILE ESML LIBRARY APPLICATION
What is ESML? • It is a specialized markup language for Earth Science metadata based on XML - NOT another data format. • It is a machine-readable and -interpretable representation of the structure, semantics and content of any data file, regardless of data format • ESML description files contain external metadata that can be generated by either data producer or data consumer (at collection, data set, and/or granule level) • ESML provides the benefits of a standard, self-describing data format (like HDF, HDF-EOS, netCDF, geoTIFF, …) without the cost of data conversion • ESML is the basis for core Interchange Technology that allows data/application interoperability • ESML complements and extends data catalogs such as FGDC and GCMD by providing the use/access information those directories lack. http://esml.itsc.uah.edu
Components of the ESML Interchange Technology ESMLSCHEMA ESMLFILE ESML LIBRARY ESML CONSISTS OF: (1) MARKUPS (2) RULES FOR THE MARKUPS (3) MIDDLEWARE FOR AUTOMATION DATAFORMAT1 DATAFORMAT2 DATAFORMAT3 OTHER FORMATS ESMLEDITOR ESMLFILE ESMLFILE ESMLFILE ESMLSCHEMA ESML LIBRARY ESMLDATA BROWSER ADaM DATA MINING SYSTEM OTHER APPLICATIONS (1) External description file for dataset or formats (2) Rules that govern the description of the data files (3) Library parses and interprets the description file and figures out how to read the data
ESML in Numerical Modeling Insolation Products GOES Skin Temp Soundings, Others ESML file ESML file ESML file Network ESML Library NUMERICAL WEATHERMODELS (MM5, ETA, RAMS) • Purpose: • Use ESML to incorporate observational data into the numerical models for simulation • Scientists can: • Select remote files across the network • Select different observational data to increase the model prediction accuracy Prediction
Collaboration Tools Technologies to coordinate complex projects • CAMEX-4 campaign • Data acquisition and integration from multiple platforms and instruments for quick exploitation • Intra-project communications before, during, and after CAMEX campaigns • Collaborators included NASA, NOAA, USAF, and multiple universities http://camex.msfc.nasa.gov
NASA managers review status Web-based interface Experiment PI Data management Coordination Clearinghouse RDBMS Radars NASA Aircraft Forecasters USAF Aircraft NOAA Aircraft Aircraft Crew: maintenance and report status. Mission Managers CAMEX-4Distributed Mission Coordination
Modeling Environment for Atmospheric Discovery (MEAD): Use of the TeraGrid Infrastructure • Argonne National Lab • Georgia Tech University • Indiana University • Lawrence Berkley National Lab • NCSA • NOAA/FSL • NOAA/NSSL • Northwestern University • Ohio State University • Oklahoma University • Portland State University • Rice University • Rutgers • UAH • UCAR • University of Wisconsin • University of Minnesota • will develop/adapt a cyberinfrastructure that will enable simulation, datamining, and visualization of hurricanes and storms • will integrate model and grid workflow management, data management, model coupling, and analysis/mining of large, ensemble datasets.
Primary MEAD Software Components • WRF Model (Weather Research and Forecasting) • ROMS Model (Regional Ocean Modeling System) • Coupled WRF/ROMS Model • D2K (Data to Knowledge) • ADaM (Algorithm Development and Mining System) • Visualization Engines (NCAR Graphics, Vis5D, IDV-VisAD, HVR, VTK) • netCDF, HDF5, ESML • Middleware (Globus, JavaCog, GridFTP) • Metadata Catalogue Service
Example MEAD Workflow Initial Setup Model Execution Post Run Analysis Initial Data and Parameters Data Mining (ADaM) Multiple WRF Models (Weather) Model Results Inter-model communications Model Results Multiple ROMS Models (Ocean) Visualization Initial Data and Parameters Need the Grid to support the huge computational, data storage and post analysis requirements
Linked Environments for Atmospheric Discovery (LEAD) Create for the university community an integrated, scalable framework for use in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information independent of format and physical location. Collaborators: • University of Oklahoma • University of Alabama in Huntsville • UCAR/Unidata • Indiana University • University of Illinois/NCSA • Millersville University • Howard University • Colorado State University
nationalsuper-computer facilities poolsof work-stations scientific instr’mts tertiary storage clusters LEAD Architecture MyLEAD Portal MyLEAD Virtual Environment Interchange Workflow Semantics for data Personal Data Space Technologies Orchestration and services Application Services Visualization Data Mining Models Others… tools Middleware Data Management Workflow Management Monitoring Grid and Web infrastructure Resource Scheduling Security Others… Allocation Distributed Resources
Collaborative Environment for Propulsion Research: Rocket Engine Advancement Program 2 • Consortium of propulsion research centers. • Auburn University • Purdue University • Pennsylvania State University • Tuskegee University • Grid configuration will make distributed computational and data resources available to researchers without having to negotiate separate access to each resource. • Linking or integration of multiple distributed experiment steps into a single investigation for more timely results and analysis. • Will rely on the security capabilities of the Grid due to the sensitive nature of the propulsion research. • University of Alabama in Huntsville • University of Tennessee • NASA Marshall Space Flight Center • NASA Glenn Research Center
Collaborative Environment for Propulsion Research Cluster(s) Supercomputer Test Equipment REAP2 Grid Portal REAP2 User Portal Data and Results Rocket Engine Advancement Program 2
Evolution of Frameworks for Advanced Applications • Changing Computational Landscape • GRIDS • Clusters • Web Services • Pervasive Computing • On-Board Processing • Middleware for applications on GRID/Clusters • Automate parallelization of mining tasks • Estimate using resource requirements using computational complexity of the algorithms • Federated Model for Mining • Individual components that can be distributed and can execute across different platforms