210 likes | 327 Views
Cyberinfrastructure for Coastal Forecasting and Change Analysis. Gagan Agrawal Hakan Ferhatosmanoglu Xutong Niu Ron Li Keith Bedford The Ohio State University. Context. New Award from Office of Cyberinfrastructure (OCI) Under Cyberinfrastructure for Environmental Observatories Program
E N D
Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan Ferhatosmanoglu Xutong Niu Ron Li Keith Bedford The Ohio State University
Context • New Award from Office of Cyberinfrastructure (OCI) • Under Cyberinfrastructure for Environmental Observatories Program • September 2006 – August 2009, total amount $1,400,000 • Involves 2 Computer Scientists and 2 Environmental Scientists • G. Agrawal (PI) – Grid Middleware • H. Ferhatosmanoglu – Databases • K. Bedford: Great Lakes Now/Forecasting • R. Li: Coastal Erosion Analysis
Project Premise • Limitation of Current Environmental Observation Systems • Tightly coupled systems • No reuse of algorithms • Very hard to experiment with new algorithms • Closely tied to existing resources • Our claim • Emerging trends towards web-services and grid-services can help
Challenges • Existing Grid Middleware Systems have not considered • Processing of Streaming Data • Data Integration Issues • The applications involved needs techniques for multi-modal data fusion, query planning, and data mining • Need to implement them as grid or web-services
Application Details: Great Lakes Now/ForeCasting • GLOS: Great Lakes Observing System • Co-designer/project manager: K. Bedford, a co-PI on this project • Collaboration with NOAA • Limitations: Hard-wired • Cannot incorporate new streams or algorithms • Create an Implementation using our Middleware for Streaming Data
Application Details: Coastal Erosion Prediction and Analysis • Focus: Erosion along Lake Erie Shore • Serious problem • Substantial Economic Losses • Prediction requires data from • Variety of Satellites • In-situ sensors • Historical Records • Challenges • Analyzing distributed data • Data Integration/Fusion
Middleware Developed at Ohio State • Automatic Data Virtualization Framework • Enabling processing and integration of data in low-level formats • GATES (Grid-based AdapTive Execution on Streams) • Processing of distributed data streams • FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid) • Supporting scalable data analysis on remote data
Automatic Data Virtualization: Motivation • Access mechanisms for remote repositories • Complex low-level formats make accessing and processing of data difficult • Main desired functionality • Ability to select, down-load, and process a subset of data • Sensor Data • Again, low level data • Need to convert formats • Need a flexible architecture
Data Virtualization An abstract view of data dataset Data Virtualization Data Service • By Global Grid Forum’s DAIS working group: • A Data Virtualization describes an abstract view of data. • A Data Service implements the mechanism to access and process data • through the Data Virtualization
Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A metadata descriptor describes the layout of data on a repository • An abstract view is exposed to the users • Two implementations: • Relational /SQL-based • XML/XQuery based
Streaming Data Model • Continuous data arrival and processing • Emerging model for data processing • Sources that produce data continuously: sensors, long running simulations • Critical In Environmental Observatories • Active topic in many computer science communities • Databases • Data Mining • Networking ….
Need for a Grid-Based Stream Processing Middleware • Application developers interested in data stream processing • Will like to have abstracted • Grid standards and interfaces • Adaptation function • Will like to focus on algorithms only • GATES is a middleware for • Grid-based • Self-adapting Data Stream Processing
Adaptation for Real-time Processing • Analysis on streaming data is approximate • Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) • Sampling Rate • Size of summary structure • Application developers can expose these parameters and a range of values
FREERIDE-G: Supporting Distributed Data-Intensive Science ? Compute Cluster User Data Repository Cluster
Challenges for Application Development • Analysis of large amounts of disk resident data • Incorporating parallel processing into analysis • Processing needs to be independent of other elements and easy to specify • Coordination of storage, network and computing resources required • Transparency of data retrieval, staging and caching is desired
FREERIDE-G Goals • Support High-End Processing • Enable efficient processing of large scale data mining computations • Ease Use of Parallel Configurations • Support shared and distributed memory parallelization starting from a common high-level interface • Hide Details of Data Movement and Caching • Data staging and caching (when feasible/appropriate) needs to be transparent to application developer
Data Analysis Services • Multi-model Multi-Sensor Data Integration • Built on our Data Virtualization Framework • Query Planning Service • Feature Extraction: Integration with Grid Metadata Catalogs • Remote Mining of Spatio-Temporal Data • Built using FREERIDE-G • Mining algorithms for Data Streams • Built using GATES
Looking For • Feedback on our approach • Synergy with other efforts • Lessons learnt by others