690 likes | 862 Views
The Future of MOCHA. Nick Roussopoulos October 5, 2001. The Problem. Distributed and heterogeneous data sources. Data Sources for an enterprise are : Distributed Internet, intranets, extranets Heterogeneous Web servers, relational databases, file systems Mission-critical
E N D
The Future of MOCHA Nick Roussopoulos October 5, 2001
The Problem Distributed and heterogeneous data sources • Data Sourcesfor an enterprise are: • Distributed • Internet, intranets, extranets • Heterogeneous • Web servers, relational databases, file systems • Mission-critical • Weather service, ocean temperature, stock status, … • Costly to replace or upgrade • Risk of breaking it and loss of investment Nick Roussopoulos
Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client The Problem High volume access from everywhere Internet Oracle 8i Informix XML Data Text Data Nick Roussopoulos
Client Client Client Client Client Internet Oracle 8i Informix XML Data Text Data Client-Server 2-tier architecture complex FAT clients Bad Idea Nick Roussopoulos
Integration Server Catalog Client Client Client Client Client Client Translator Translator Translator Translator Middleware 3-tier architecture Thin & fit clients Internet Oracle 8i Informix XML Data Text Data Nick Roussopoulos
Nice but… • Most middleware solutions are static • Not flexible for dynamic environments • Not scalable to hundreds of client and server sites • Development cost is high • One-site-at-a-time at a fixed cost • Maintenance cost is high • Upgrades are practically redevelopments Nick Roussopoulos
A dynamic world needs Code extensibility & auto-deployment • Need for user-defined types and functions • Polygon • Composite() – image aggregation • Porting and manual installation of code (C/C++) • Operating System • Hardware Platform • High cost of code maintenance • Updates on all platforms • Version management • Security in hostile platforms Nick Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator Code Deployment Problem Internet Not Scalable Oracle 8i Informix XML Data Text Data Nick Roussopoulos
Query Processing • Query execution options • Limited by site-dependent software • Composite() – must be ported before use • Most processing done at the Integration Server • Powerful Data Servers are under-utilized • I/O Nodes • Excessive data movement over the network • Network bottleneck • Slow internet access Nick Roussopoulos
Integration Server Catalog Client Client 200MB 200MB 200MB Translator Translator Translator Translator 100MB 100MB 100MB Query Processing Problem Internet Inefficient & not scalable Oracle 8i Informix XML Data Text Data Nick Roussopoulos
Solution MOCHA Middleware Based On a Code SHipping Architecture Nick Roussopoulos
DAP DAP Client Q Q Q Q Q Q Q Q Q MOCHA Solution: Ship Java Code Mochlets Code Repository Catalog Informix Oracle QPC No code porting & no maintenance Maryland Texas Virginia Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia Internet Nick Roussopoulos
100MB 200MB tuples tuples DAP DAP Client 200KB 350KB 150KB 200KB 200KB 150KB 150KB 350KB results results results results results results results results MOCHA Solution: Filter Data @ Source Code Repository Catalog Informix Oracle QPC No bandwidth waste Maryland Texas Virginia Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia Internet Nick Roussopoulos
Code Repository Catalog OS File DBMS QPC DAP DAP Client Software architecture Nick Roussopoulos
QPC: The Query Processing Coordinator QPC Controls and Coordinates Query Execution Client API Query Parser Code Repository XML Catalog Query Optimizer Catalog Manager Execution Engine SQL & XML Proc. Interface Code Loader DAP Access API DAP Nick Roussopoulos
DAP: The Data Access Provider DAP Provides QPC with Remote Access to the Data DAP Access API Control Module Execution Engine SQL & XML Proc. Interface Code Loader Data Source Access Layer Data Source JDBC I/O API DOM JNI Nick Roussopoulos
Data Server: Storage System • Stores and Manages the data sets • database, web server, file system, XML repository Data Server Nick Roussopoulos
Table Rasters location image week band Query: Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Processing a Query in MOCHA • Query Parsing • Resource Discovery • Query Optimization • Metadata and Control Exchange • Code Deployment Phase • Query Execution Nick Roussopoulos
Coordination Thread Execution Thread Client Client Execution Thread Plan Generation QPC Code Repository Catalog DAP DAP Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Informix Oracle Nick Roussopoulos
Coordination Thread Execution Thread Client Client Execution Thread Automatic Code Deployment QPC Code Repository Catalog DAP DAP Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Informix Oracle Nick Roussopoulos
Coordination Thread Execution Thread Client Client Execution Thread Data Processing QPC Code Repository Catalog DAP DAP Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Informix Oracle Nick Roussopoulos
Features of MOCHA • Automatic code deployment • “Plug-N-Play” • no system-wide installations • Metadata and Schema Mapping framework • XML, RDF • easy to exchange and map schemas • semi-automatic mapping • Query optimization based on code shipping • reduce data movement overhead • filters at the source • expands at the client • metrics for code (operator) placement • optimization for selection, union and join plans Nick Roussopoulos
MOCHA Demo: Global Land Cover Facility • Integrates the following DAP sites • University of New Hampshire (Webster), NASA GSFC, UMD-CS, UMD-Geography, UMD-UMIACS SP-2 HPSS • GLCF hosts the QPC • Operations supported: • Coverage queries • Visualization of preview images for • Data sets MODIS, TM, AVHRR • GIS Features • Dynamic Sub-setting of TM scenes • Composites of GIS Features and AVHRR images Nick Roussopoulos
Multi-Sensor Analysis of the Los Alamos Fire Event Using MOCHA • Data Synergy and Multi-Resolution Instrument Analysis using MOCHA • Access data residing at various data sources • Utilize image processing tools • Fire Analysis required a multi-resolution approach • MOCHA is independent of instrument or resolution specifics • High Resolution: IKONOS and TM data • Moderate Resolution: 250m MODIS • Coarse Resolution: AVHRR and DMSP Nick Roussopoulos
MOCHA Search Utility Nick Roussopoulos
MOCHA Search Utility (cont’d) Nick Roussopoulos
MOCHA Search Utility (cont’d) Nick Roussopoulos
MOCHA Query Results Nick Roussopoulos
MOCHA ETM+ Subsetting Utility Nick Roussopoulos
May 9, 2000 Los Alamos (Bands 1,2,3) Nick Roussopoulos
May 9, 2000 Los Alamos (Bands 7,5,4) Nick Roussopoulos
Multi-Sensor Query Nick Roussopoulos
Tabular Query Results Nick Roussopoulos
MODIS: May 11, 2000: During Fire Nick Roussopoulos
MODIS: May 24, 2000: After Fire Nick Roussopoulos
DMSP: Night Visibility of Fire Nick Roussopoulos
IKONOS 4m resolution Nick Roussopoulos
IKONOS 4m Subset Nick Roussopoulos
IKONOS 1m resolution Nick Roussopoulos
IKONOS 1m Subset Nick Roussopoulos
MOCHA Metadata Publishing Framework • Provides information about system resources • Data sources • schemas and mappings • user-defined types and functions • Automates operation of MOCHA • Incremental system growth • neither fixed nor hardwired parameters • no extension by re-compilation • Share metadata with others (Internet) • machine readable form Nick Roussopoulos
MOCHA Catalog Organization • Metadata about “resources” • Local and global tables • UDF data types and operators • Schema mapping rules • DAPs • Each one has Uniform Resource Identifier (URI) • global namespace • e.g.: mocha://cs1.umd.edu/EarthSci/Polygon • Modeled with RDF, serialized with XML • easy to understand, use and exchange Nick Roussopoulos
RDF Model: Data Types mocha://cs1.umd.edu/EarthSci/Raster mocha:Type mocha:Creator user1@cs.umd.edu Raster mocha:Size mocha:Class mocha:Repository cs1.umd.edu/EarthSci Raster.class 1 megabyte Nick Roussopoulos
<rdf:Descriptionabout= “mocha://cs1.umd.edu/EarthSci/Raster”> <mocha:Type>Raster</mocha:Type> <mocha:Class> Raster.class </mocha:Class> <mocha:Repository> cs1.umd.edu/EarthSci </mocha:Repository> <mocha:Size> 1 MB</mocha:Size> <mocha:Creator>user1@cs1.umd.edu </mocha:Creator> </rdf:Description> XML Serialization: Data Types • W3C Standards • Easy to specify using GUI tools • Easy to exchange • Crawlers can harvest it • Stored in • DB • File System Nick Roussopoulos
Other Resources in MOCHA • Local and Global tables • data sources + columns + types • UDF Functions • argument types + return type • code repository • Schema mapping rules • DAPs • URL • login information Nick Roussopoulos
location image week band point1 point2 photo date band rect() week() Schema Mapping in MOCHA • Direct column mappings • Complex Expressions RastersMD Rasters Nick Roussopoulos
Plan Tree SMP SMP SMP MOCHA Schema Mapping Rules • Use XML to encode mapping rules • Schema mapping sub-plans • leaf nodes <MapList> <mi mapped = “direct”> <mocha:Column>image</mocha:Column> <mocha:Expr>photo</mocha:Expr> </mi> <mi mapped = “expression”> <mocha:Column> location </mocha:Column> <mocha:Expr> rect(point1, point2) </mocha:Expr> </mi> … Nick Roussopoulos
MOCHA Optimization Framework • Query optimization based on heuristics • cost = network + CPU + I/O • Network is the dominant factor (WAN) • optimize for it first • CPU and I/O are cheaper • optimize for them later • Operator placement: Enhanced Hybrid Shipping • Code • Data Nick Roussopoulos
Composite() Operator Placement in MOCHA • Data-Reducing Operators • “Filter” the data • aggregates, predicates, projections, semi-joins • Composite(), Overlaps() , AvgEnergy() • Push to the DAPs • Return distilled results • Less data movement Nick Roussopoulos
DoubleRes() Operator Placement in MOCHA • Data-Inflating Operators • “Expand” the data • projections, image processing, some joins … • DoubleResolution(), RotateSolid() • Pull to the QPC • Data Shipping policy [FJK96] • Only send back raw arguments • Less data movement Nick Roussopoulos