1 / 27

MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources

MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources. Manuel Rodriguez-Martinez Nick Roussopoulos. Client. Client. Motivation. Data Sources are distributed and heterogeneous : Fact of Life. Internet. Oracle 8i. Informix. XML Data. Text Data.

shepry
Download Presentation

MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources Manuel Rodriguez-Martinez Nick Roussopoulos

  2. Client Client Motivation Data Sources are distributed and heterogeneous: Fact of Life ... Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

  3. Not a Good Idea Client Client Client-Server Connectivity 2-tier architecture means FAT Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

  4. Integration Server Catalog Client Client Translator Translator Translator Translator Middleware Integration Service Middleware is a 3-tier connectivity solution – Thin Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

  5. Problem 1: Code Deployment • User-defined types and functions • Polygon • Composite() – image aggregation • Porting and manualinstallation of code • Operating system • Hardware platform • Expensive Software Maintenance • Updates • Version management • Security • Software certification M. Rodriguez-Martinez – N. Roussopoulos

  6. Integration Server Catalog Client Client Translator Translator Translator Translator Problem 1: Code Deployment Not Scalable – Expensive System Growth Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

  7. Problem 2: Query Processing • Operator placement options • Limited by site-dependent software • Composite() – got to have it before using it! • Most processing at Integration Server • Powerful Data Servers are under-utilized • I/O Nodes • Excessive data movement over the network • Network bottleneck • Unfeasible in WANs, Internet M. Rodriguez-Martinez – N. Roussopoulos

  8. Integration Server Catalog Client Client Translator Translator Translator Translator 100MB 100MB 100MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

  9. DAP DAP Client MOCHA Solution: Ship Code! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos

  10. 100MB 200MB tuples tuples DAP DAP Client 350KB 200KB 150KB 150KB 150KB 200KB 350KB 200KB results results results results results results results results MOCHA Solution: Filter Data! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos

  11. MOCHA Goals • Automatic Deployment of Code (self-extensible) • QPC ships compiled Java classes • User-defined types and functions • XML for their metadata (easy exchange) • Data processing at data source sites • Utilize powerful machines • On-site data distillation • Processing based on data movement reduction • “Filter” data at the data sources • “Expand” data near the clients M. Rodriguez-Martinez – N. Roussopoulos

  12. Coordination Thread Execution Thread Client Client Execution Thread The MOCHA Architecture QPC Code Repository Catalog DAP DAP • Multi-threaded • Distributed Objects Informix Oracle M. Rodriguez-Martinez – N. Roussopoulos

  13. QPC: The Integration Server QPC Controls and Coordinates Query Execution Client API Query Parser Code Repository XML Catalog Query Optimizer Catalog Manager Execution Engine SQL & XML Proc. Interface Code Loader DAP Access API DAP M. Rodriguez-Martinez – N. Roussopoulos

  14. 100MB 100MB tuples tuples 150KB results 100MB 100MB tuples tuples DAP: The Facilitator of Data DAP Provides QPC with Remote Access to the Data DAP Access API Control Module Execution Engine SQL & XML Proc. Interface Code Loader Data Source Access Layer Data Source JDBC I/O API DOM JNI M. Rodriguez-Martinez – N. Roussopoulos

  15. Road Map • Introduction • Problem Definition • MOCHA Architecture • Query Processing • Experiments • Summary M. Rodriguez-Martinez – N. Roussopoulos

  16. Processing The Queries • Issue 1: Placement and deployment of operators • Which operators go to QPC, and which go to the DAPs? • Issue 2: How to determine this placement? • Dynamic programming [SAC+79], [ML86] • But search space is enormous • Placement of UDF, joins, execution sites … • Plenty of “bad” plans • In MOCHA: Query Optimization based on heuristics • Network usually is the critical factor  optimize for it first • CPU and I/O are cheaper  optimize for them later • Quickly converge to a “good” plan M. Rodriguez-Martinez – N. Roussopoulos

  17. Operator Placement • Data-Reducing Operators • “Filter” the data • Aggregates, predicates, projections, semi-joins • Composite(), Overlaps() , AvgEnergy() • Push to the DAPs • Code Shipping policy (Unique to MOCHA) • Only send back distilled results • Less data movement • Cost: • Computation cost • Transfer of filtered results M. Rodriguez-Martinez – N. Roussopoulos

  18. Operator Placement • Data-Inflating Operators • “Expand” the data • projections, image processing, some joins … • DoubleResolution(), RotateSolid() • Pull to the QPC • Data Shipping policy [FJK96] • Only send back raw arguments • Less data movement • Cost: • Computation cost • Transfer of raw argument values M. Rodriguez-Martinez – N. Roussopoulos

  19.  is Data-Inflating  VRF  1  is Data-ReducingVRF < 1 Composite() DoubleRes() Placement Metric: VRF Volume Reduction Factor: Given operator  and relation R, then • VDT - volume of data transmitted after applying  to R • VDA - volume of data originally present in R M. Rodriguez-Martinez – N. Roussopoulos

  20. Goal: Plans with small CVRF Cumulative Volume Reduction Factor: Given a plan P to solve query Q over relations R1, …, Rn • CVDT - volume of data transmitted by applying • all operators in P to R1, …, Rn • CVDA- volume of data originally present in R1, …, Rn Search Space Optimizer searches for plans that move minimal amount of data. CVRF(Plan)  [0,1] M. Rodriguez-Martinez – N. Roussopoulos

  21. Performance Evaluation • Goals of this study: • Measure how good code shipping can be • Validate heuristics being proposed • VRF • CVRF • Guide implementation of the optimizer • Configured MOCHA with plans that place operators based on heuristics. M. Rodriguez-Martinez – N. Roussopoulos

  22. Experimental Environment • Sequoia 2000 Benchmark • scientific data - points, polygons, satellite images • Distributed applications • Software and Hardware: • JDK 1.2 • QPC - Sun Ultra 60, Solaris 2.6 • DAPs - Sun Ultra 1, Sun Ultra5, Solaris 2.6 • Data Sources • 2 Informix IUS 9.12 Server • 10 Mpbs Ethernet M. Rodriguez-Martinez – N. Roussopoulos

  23. DAP QPC QPC QPC DAP DAP Q1 Q2 Q3 Query Class Reducing vs. Inflating • Query classes • Composite of all images • Clipping and sub-setting • Double resolution of images • Performance gains • composites • 99% data reduction • 4-1 better performance • clipping and expansion • 80% data reduction • 3-1 better performance • Validates heuristics Runnning Time (secs) M. Rodriguez-Martinez – N. Roussopoulos

  24. Runnning Time (secs) .50 .75 1 0 .25 Selectivity QPC DAP DAP DAP DAP DAP QPC QPC QPC QPC VRF vs Selectivity • Select graphs identifiers based on number of vertices and arc length • Selectivity [HS93] and cardinality [HKWY97] are not enough for distributed predicate placement • Need to also consider size of arguments for predicates! • Consider 50% selectivity • DAP  CVRF = 0.01 • QPC  CVRF = 1 • VRF is a better metric M. Rodriguez-Martinez – N. Roussopoulos

  25. Implementation Status • Operational System • SIGMOD 2000 Demo • Experimental deployment of MOCHA • NASA Earth Scientists (ESIP Federation) • Goddard Space Flight Center • NCSA • Land Cover Visualization Tool M. Rodriguez-Martinez – N. Roussopoulos

  26. Summary and Conclusions • Proposed a new Middleware Architecture: MOCHA • Automatic Code Deployment (self-extensible) • Shipping Java classes • Query processing based on data movement reduction • Proposed VRF metric for placement of functions • Better than selectivity and result cardinality • Future work • Deployment of MOCHA for NASA ESIP Federation • Full implementation of MOCHA Optimizer • More Info: • http://mocha.umiacs.umd.edu/ M. Rodriguez-Martinez – N. Roussopoulos

  27. Integration Server Catalog Client Client Translator Translator Translator Translator 200MB 200MB 100MB 100MB 100MB 200MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos

More Related