270 likes | 368 Views
MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources. Manuel Rodriguez-Martinez Nick Roussopoulos. Client. Client. Motivation. Data Sources are distributed and heterogeneous : Fact of Life. Internet. Oracle 8i. Informix. XML Data. Text Data.
E N D
MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources Manuel Rodriguez-Martinez Nick Roussopoulos
Client Client Motivation Data Sources are distributed and heterogeneous: Fact of Life ... Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Not a Good Idea Client Client Client-Server Connectivity 2-tier architecture means FAT Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator Middleware Integration Service Middleware is a 3-tier connectivity solution – Thin Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Problem 1: Code Deployment • User-defined types and functions • Polygon • Composite() – image aggregation • Porting and manualinstallation of code • Operating system • Hardware platform • Expensive Software Maintenance • Updates • Version management • Security • Software certification M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator Problem 1: Code Deployment Not Scalable – Expensive System Growth Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Problem 2: Query Processing • Operator placement options • Limited by site-dependent software • Composite() – got to have it before using it! • Most processing at Integration Server • Powerful Data Servers are under-utilized • I/O Nodes • Excessive data movement over the network • Network bottleneck • Unfeasible in WANs, Internet M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator 100MB 100MB 100MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
DAP DAP Client MOCHA Solution: Ship Code! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos
100MB 200MB tuples tuples DAP DAP Client 350KB 200KB 150KB 150KB 150KB 200KB 350KB 200KB results results results results results results results results MOCHA Solution: Filter Data! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos
MOCHA Goals • Automatic Deployment of Code (self-extensible) • QPC ships compiled Java classes • User-defined types and functions • XML for their metadata (easy exchange) • Data processing at data source sites • Utilize powerful machines • On-site data distillation • Processing based on data movement reduction • “Filter” data at the data sources • “Expand” data near the clients M. Rodriguez-Martinez – N. Roussopoulos
Coordination Thread Execution Thread Client Client Execution Thread The MOCHA Architecture QPC Code Repository Catalog DAP DAP • Multi-threaded • Distributed Objects Informix Oracle M. Rodriguez-Martinez – N. Roussopoulos
QPC: The Integration Server QPC Controls and Coordinates Query Execution Client API Query Parser Code Repository XML Catalog Query Optimizer Catalog Manager Execution Engine SQL & XML Proc. Interface Code Loader DAP Access API DAP M. Rodriguez-Martinez – N. Roussopoulos
100MB 100MB tuples tuples 150KB results 100MB 100MB tuples tuples DAP: The Facilitator of Data DAP Provides QPC with Remote Access to the Data DAP Access API Control Module Execution Engine SQL & XML Proc. Interface Code Loader Data Source Access Layer Data Source JDBC I/O API DOM JNI M. Rodriguez-Martinez – N. Roussopoulos
Road Map • Introduction • Problem Definition • MOCHA Architecture • Query Processing • Experiments • Summary M. Rodriguez-Martinez – N. Roussopoulos
Processing The Queries • Issue 1: Placement and deployment of operators • Which operators go to QPC, and which go to the DAPs? • Issue 2: How to determine this placement? • Dynamic programming [SAC+79], [ML86] • But search space is enormous • Placement of UDF, joins, execution sites … • Plenty of “bad” plans • In MOCHA: Query Optimization based on heuristics • Network usually is the critical factor optimize for it first • CPU and I/O are cheaper optimize for them later • Quickly converge to a “good” plan M. Rodriguez-Martinez – N. Roussopoulos
Operator Placement • Data-Reducing Operators • “Filter” the data • Aggregates, predicates, projections, semi-joins • Composite(), Overlaps() , AvgEnergy() • Push to the DAPs • Code Shipping policy (Unique to MOCHA) • Only send back distilled results • Less data movement • Cost: • Computation cost • Transfer of filtered results M. Rodriguez-Martinez – N. Roussopoulos
Operator Placement • Data-Inflating Operators • “Expand” the data • projections, image processing, some joins … • DoubleResolution(), RotateSolid() • Pull to the QPC • Data Shipping policy [FJK96] • Only send back raw arguments • Less data movement • Cost: • Computation cost • Transfer of raw argument values M. Rodriguez-Martinez – N. Roussopoulos
is Data-Inflating VRF 1 is Data-ReducingVRF < 1 Composite() DoubleRes() Placement Metric: VRF Volume Reduction Factor: Given operator and relation R, then • VDT - volume of data transmitted after applying to R • VDA - volume of data originally present in R M. Rodriguez-Martinez – N. Roussopoulos
Goal: Plans with small CVRF Cumulative Volume Reduction Factor: Given a plan P to solve query Q over relations R1, …, Rn • CVDT - volume of data transmitted by applying • all operators in P to R1, …, Rn • CVDA- volume of data originally present in R1, …, Rn Search Space Optimizer searches for plans that move minimal amount of data. CVRF(Plan) [0,1] M. Rodriguez-Martinez – N. Roussopoulos
Performance Evaluation • Goals of this study: • Measure how good code shipping can be • Validate heuristics being proposed • VRF • CVRF • Guide implementation of the optimizer • Configured MOCHA with plans that place operators based on heuristics. M. Rodriguez-Martinez – N. Roussopoulos
Experimental Environment • Sequoia 2000 Benchmark • scientific data - points, polygons, satellite images • Distributed applications • Software and Hardware: • JDK 1.2 • QPC - Sun Ultra 60, Solaris 2.6 • DAPs - Sun Ultra 1, Sun Ultra5, Solaris 2.6 • Data Sources • 2 Informix IUS 9.12 Server • 10 Mpbs Ethernet M. Rodriguez-Martinez – N. Roussopoulos
DAP QPC QPC QPC DAP DAP Q1 Q2 Q3 Query Class Reducing vs. Inflating • Query classes • Composite of all images • Clipping and sub-setting • Double resolution of images • Performance gains • composites • 99% data reduction • 4-1 better performance • clipping and expansion • 80% data reduction • 3-1 better performance • Validates heuristics Runnning Time (secs) M. Rodriguez-Martinez – N. Roussopoulos
Runnning Time (secs) .50 .75 1 0 .25 Selectivity QPC DAP DAP DAP DAP DAP QPC QPC QPC QPC VRF vs Selectivity • Select graphs identifiers based on number of vertices and arc length • Selectivity [HS93] and cardinality [HKWY97] are not enough for distributed predicate placement • Need to also consider size of arguments for predicates! • Consider 50% selectivity • DAP CVRF = 0.01 • QPC CVRF = 1 • VRF is a better metric M. Rodriguez-Martinez – N. Roussopoulos
Implementation Status • Operational System • SIGMOD 2000 Demo • Experimental deployment of MOCHA • NASA Earth Scientists (ESIP Federation) • Goddard Space Flight Center • NCSA • Land Cover Visualization Tool M. Rodriguez-Martinez – N. Roussopoulos
Summary and Conclusions • Proposed a new Middleware Architecture: MOCHA • Automatic Code Deployment (self-extensible) • Shipping Java classes • Query processing based on data movement reduction • Proposed VRF metric for placement of functions • Better than selectivity and result cardinality • Future work • Deployment of MOCHA for NASA ESIP Federation • Full implementation of MOCHA Optimizer • More Info: • http://mocha.umiacs.umd.edu/ M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator 200MB 200MB 100MB 100MB 100MB 200MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos