370 likes | 384 Views
Learn about Mariposa, a wide-area distributed database system that addresses the challenges of managing databases in a WAN environment. Discover its architecture, scalability, and economic paradigm of bidding.
E N D
Mariposa: a wide-area distributed database system Kumar Ramdurgkar. CIS 661
Mariposa Distributed Database Management System Principal Investigator: Prof. Michael Stonebraker
SECTION 1 • Introduction to Mariposa
LAN Vs WAN databases • LAN database management is common most often used in industries where the data is local to the installation. • LAN has a single RDBMS source. • LAN is maintained by a well defined set of rules, data types, and services. The difference ?
WAN Databases • Many databases interconnected over a WAN • In WAN there are many sites participating in the DBMS • Different site administrators. • Different data types, extensions and service handling times. • How do we interconnect ? • What are the issues ?
Issues and problems • Network connections and traffic. • Different ‘load’ handling capabilities and service times. • Different data type and extensions. • A single program acting as a query optimizer will NOT work continued…
Issues and problems • Cost based optimization does not respond well to site specific type extensions and access constraints, charging algorithms and time-of-day constraints. • No proper scaling for LAN algorithms to suite WAN DBMS The Solution…
An excellent idea ! MARIPOSA • UBID !! Have you been there ?? • The Mariposa is a distributed DBMS working on the economic paradigm of Bidding. Mariposa was proposed by: Michael Stonebraker, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, Andrew Yu Proposed: Nov 1994 Accepted: Sept 1995
Mariposa…vision • Standard approach for distributed data. • A set of standard guidelines for WAN databases. • Application of query storage and optimization using a different perspective. • Scalability and data explosion handling. • A query optimizer for the WWW ?? • Need to formalize
WAN Guidelines for Mariposa • Scalability to a large number of cooperating sites. • Data mobility. • No global synchronization of data. • Total local autonomy and complete control. • Easily configurable policies for changing the behavior of Mariposa.
Mariposa System architecture • Microeconomic mechanisms. • All Mariposa clients and servers have a account with a network bank. • A user allocates a budget in the currency of this bank to each query. • The goal of the query processing system is to solve the query within the allotted time by contracting various Mariposa clients.
Mariposa Broker mechanism • Obtain bid pieces for a query from sites. • Uses a distributed advertising system as over the usual META – DATA mechanisms used in LAN. • The server who has advertised the best time for the given query wins.
Scalability • Site can join Mariposa by buying ‘objects’ and advertising services • Site can leave Mariposa by selling objects and by ceasing to bid. • Hence a highly scalable system. • Infact the success of Mariposa depends on a large number of sites participating in the system.
Storage decisions • Objects have no notion of home. • All secondary indices are moved with the objects. • Avoidance of global sync is simplified because of the economic paradigm. • Mariposa fosters data mobility and free trade of objects • Object here means ‘data’
Total local control • Since each Mariposa site is free to bid on any business of interest, it has total local autonomy. • Each site is expected to maximize its individual profit per unit of operating time and to bid on those queries that it feels will accomplish this goal.
Sounds good… any drawbacks ?? • Some queries may not be solvable either because nobody will bid on them or the minimum bids exceeds what the client is willing to pay. • A site can refuse to give up objects • A site may not find buyers for objects that it wants to sell.
SECTION 2 • Mariposa architecture
Mariposa Architectural details • Hardware Flow chart • Processes (bidding, bid protocols, acceptance, finding bidders, sub–query bidding, network bidding, splitting and combining) • Code languages (RUSH) • Mariposa experiments and results • Conclusions
Architecture overview • Client query in SQL3 • Middleware consists of several query separator and query broker. • Broker and Bidder coded in RUSH. • Local execution at the site that wins the bid. • Details…
Processes : Bidding • Each query Q has a budget B(t) that can be used to solve the query • The budget is a value the user gives to solve this query. • Broker receives query plan for Q and tries to bid and solve each fragment using either the expensive bid protocol or a cheaper purchase order protocol.
Processes : Bidding • Brokers split each query into sub queries and bid for each sub query • There is a set sequence of sub query execution. • Finding the right winners is implemented in a greedy algorithm at the broker.
Processes : Bid Protocols • The expensive bid protocol has 2 phases: • Broker sends requests and Bidder sends back triplet value (Ci, Di, Ei) indicating cost Ci for Delay of Di and expiration of bid is Ei (for Qi) • The broker notifies winners (and losers). • The purchase order protocol is faster and involves the Broker sending the query to the site it is most likely to be processed. There is a risk that the query might not be processed in the given time.
Finding Bidders • Brokers examine ‘Ad Tables’ to find out the servers that are willing to perform the task at hand. • Using records in an Ad Table the server posts its bids. • Ad tables typically have the bidding information for the sample query structures run on that server.
Sample Ad Table design • Not all fields might be used
Bidding strategies • Bulk purchase contracts allowing lower than normal bids (wholesale) • Coupons • Sale • Broker intelligence (remember last successful bid history and try that site query combination again)
Processes: Network Bidding • Account for network bandwidth. • Data size comes into the consideration. • Minimum available bandwidth is calculated from node to node. • This bandwidth must be reserved to achieve desired performance. • Mariposa uses Telnet protocols RTIP and RCAP for network bidding.
Coding (RUSH language) • Mariposa provides a low level, very efficient embedded scripting language and rule system called Rush • Using Rush, it is straightforward to change policy decisions; one simply modifies the rules by which these modules are implemented. • The Mariposa architecture is primarily coded in Rush.
SECTION 3 • Mariposa experiments and results
Operational system • Mariposa operational on Digital Equipment Corp. Alpha AXP workstations. UC Berkeley, The basic server engine is that of POSTGRES. • Implementation of the Rush language itself has required careful design and performance engineering. • Requirement of multithreaded network communication package.
Experiment setup • Workstations connected by 10MB/s ethernet • WAN experiments conducted at night. • The benchmark database consists of three tables, R1, R2 and R3. • The workload query is an equijoin of all three tables: SELECT * FROM R1, R2, R3 WHERE R1.u1 = R2.u1 AND R2.u1 = R3.u1
In the wide area case, the query originates at Berkeley and performs the join over the WAN connecting UC Berkeley,UC Santa Barbara and UC San Diego.
Conclusions • Mariposa, a prototype data management system that unifies the best features of distributed operating system and distributed database management system research. • Distributed query optimization has been identified as an area that will receive a strong emphasis and we will also examine how to build a system that has a rule system at its core.
Conclusions • Future work remains in the areas of system robustness, distributed failure recovery, and performance assessment.
References • Mariposa home http://s2k-ftp.cs.berkeley.edu:8000/mariposa/index.html