320 likes | 540 Views
Interoperability of a Scalable Distributed Data Manager with an Object-relational DBMS. Thesis presentation Yakham NDIAYE November, 13 the 2001. Objective. Develop techniques for the interoperability of a DBMS with an external SDDS file.
E N D
Interoperability of a Scalable Distributed Data Manager with an Object-relational DBMS Thesis presentation Yakham NDIAYE November, 13the 2001
Objective • Develop techniques for the interoperability of a DBMS with an external SDDS file. • Examine various architectural issues, making such a coupling the most efficient. • Validate our technical choices by the prototyping and the experimental performances analysis. • Our approach is at the crossing the main memory DBMS, the object-relational-DBMS with the foreign functions, and the distributed/parallel DBMS.
Plan • Multicomputers • SDDSs • AMOS-II & DB2 DBMSs • Coupling SDDS and AMOS-II • Coupling SDDS and DB2 • Experimental analysis • Conclusion
Multicomputers • A collection of loosely coupled computers • Computers inter-connected by high-speed local area networks. • Cost/Performance • offers potentially storage and processing capabilities rivaling a supercomputer at a fraction of the cost. • New architectural concepts • offer to applications the cumulated CPU and storage capabilities of a large number of inter-connected computers.
SDDS • New data structures specifically for Multicomputers • Data are structured - records with keys • parallel scans & function shipping • Data are onservers -waiting for access • Overflowing serverssplitinto new servers -appended to the file without informing the clients • Queries come from multiple autonomousclients -Access initiators - Not using anycentralizeddirectory for access computations • See for more:http://ceria.dauphine.fr
AMOS-II DBMS • AMOS-II :Active Mediating Object System • A main memory database system. • Declarative query language : AMOSQL. • External data sources capability. • External program interfaces AMOS-II using: -Call-level interface (call-in) -Foreign functions (call-out) • See the AMOS-II page for more: http://www.dis.uu.se/~udbl/
DB2 Universal Database • IBM object-relational DBMS « DB2 Universal Database ». • Typical representative of a commercial relational-object DBMS. • Capabilities to handle external data through the user-defined functions (UDF).
Coupling Strategies • AMOS-SDDS Strategy : - for a scalable RAM file supporting database queries - Use a DBMS for manipulations best handled through by the query language ; -Direct fast data access for manipulations not supported well, or at all, by a DBMS ; -Distributed queries processing with functions shipping.
AMOS-SDDS System AMOS-SDDS scalable parallel query processing
Coupling Strategies • SD-AMOS Strategy : -Uses AMOS-II as the memory manager at each SDDS storage site ; -Scalable generalization of a parallel DBMS ; - Data partitioning becomes dynamic.
SD-AMOS System SD-AMOS scalable parallel query processing
Couplage SDDS & DB2 • DB2-SDDS Strategy : - Coupling of a DBMS with an external data repository with direct fast data access. - Use of a SDDS file by a DBMS like an external data repository. -Offer to the user an interface more elaborate than that of SDDS manager, in particular by his query language .
Coupling SDDS & DB2 DB2-SDDS Overall Architecture Register a user-defined external table function : CREATE FUNCTION scan(Varchar(20)) RETURNS TABLE (ssn integer, name Varchar(20),city Varchar(20)) EXTERNAL NAME ‘interface!fullscan'
Coupling SDDS & DB2 Foreign functions to access SDDS records from DB2: range(cleMin, cleMax) -> liste enregistrements dont cleMin < clé < cleMax scan(nom_fichier)-> liste de tous les enregistrements du fichier Sample queries : - Parallel scan All SDDS records. select * from table( scan(‘fichier’) ) as table_sdds(SSN, NAME,CITY) - Range query SDDS records where key between 1 and 100. select * from table( range(1, 100) ) as table_sdds(SSN, NAME,CITY) order by Name
The Hardware • Six Pentium III 700 MHz with 256 MB of RAM running Windows 2000 • On a 100Mbit/s Ethernet network. • One site is used as Client and the five other as Servers • We run many servers at the same machine (up to 3 per machine). • File scaled from 1 to 15 servers.
Benchmark queries • Benchmark data : • Table Person (SS#, Name, City). • Size 20,000 to 300,000 tuples of 25 bytes. • 50 Cities. • Random distribution. • Benchmark query : « couples of persons inthe same city » • Query 1, the file resides at a single AMOS-II. • Query 2, the file resides at AMOS-SDDS. • Join evaluation : Two strategies. • Measures : - Speed-up & Scale-up • Processing time of aggregate functions
Server Query Processing • E-strategy • Data stay external to AMOS »within the SDDS bucket • Custom foreign functions perform the query • I-strategy • Data are dynamically imported into AMOS-II • » Possibly with the local index creation • »Deleted after the processing • » Good for joins • AMOS performs the query
Serveur nodes Server nodes 1 1 2 2 3 3 4 5 4 5 Nested-loop(s) Elapsed time(s) 128 1,344 78 681 64 468 55 48 358 288 Time per tuple (ms) 67.2 34 23.4 17.9 14.4 Index lookup(s) 60 39 37 36 32 Speed-up Elapsed time of Query 2 according to the strategy for a file of 20,000 records, distributed over 1 to 5 servers. E-Strategy for Query 2: elapsed time I-Strategy for Query 2: elapsed time Elapsed time per tuple of Query 2 according to the strategy
Discussion • The results showed an important advantage of I-Strategy on E-Strategy for the evaluation of the join query. • For 5 servers, the rate is 6 times for the nested loop, and 9 times if an index is creates. • The favorable result makes us study the scale-up characteristics of AMOS-SDDS on a file that scales up to 300,000 tuples.
File size 20,000 60,000 100,000 160,000 200,000 240,000 300,000 # SDDS servers 1 3 5 8 10 12 15 Q1 (ms) 3.05 5.02 6.84 11.36 12.77 16.25 18.55 Q2 (ms) 2.55 3.08 3.35 6.16 6.39 8.43 8.75 Q1 w. extrap. (ms) 3.05 5.02 6.84 8.28 9.6 10.64 12.72 Q2 w. extrap. (ms) 2.55 3.08 3.35 3.11 3.2 2.84 2.94 AMOS-II (ms) 2.30 7.17 12.01 19.41 24.12 29.08 36.44 Scaling the number of servers Q1 = AMOS-SDDS join; Q2 = AMOS-SDDS join with count. Time per tuple (extrapolated for AMOS-SDDS) Elapsed time of join queries to AMOS-SDDS
Scaling the number of servers • Results are extrapolated to 1 server per machine. - Basically, the CPU component of the elapsed time is divided by 3 • The extrapolation of the processing time of the join query with count shows a linear scalability of the system. • Processing time per tuple remains constant(2.94ms) when the file size and the number of servers increase by the same factor. Expected time per tuple of join queries to AMOS-SDDS
# servers 1 2 3 4 5 E-Stratégie (ms) 10 10 10 10 10 I-Stratégie (ms) 1,462 761 511 440 341 Aggregate Function count Elapsed time of aggregate functions Count under AMOS-SDDS Elapsed time over 100,000-tuple file on AMOS-SDDS Elapsed times for AMOS-II = 280ms Elapsed time of aggregate function Count
#servers 1 2 3 4 5 I-Stratégie (ms) 420 210 140 110 90 I-Stratégie (ms) 1,663 831 561 491 390 Aggregate Function max Elapsed time of aggregate functions Max under AMOS-SDDS Elapsed time over 100,000-tuple file on AMOS-SDDS Elapsed times for AMOS-II = 471ms Elapsed time of aggregate function Max
Discussion • Contrary to the join query, the external strategy is gaining for the evaluation of aggregate functions. • For count function, improvement is about 34 times. • For max function, improvement is about 4 times. • Due to the importation cost and to a SDDS property : the current number of records is a parameter of a bucket. • Linear Speed-up : processing time decreases with the number of servers. • The use of the external functions can thus be very advantageous for certain kind of operations.
SD-AMOS performance measurements Creation time of 3,000,000 records file. The bucket size is 750,000 records of 100 bytes Global and moving average insertion time of a record
SD-AMOS performance measurements Elapsed time of range query Average time per tuple
Discussion • The average insertion time of a record with the splits is of 0.15ms. • The average access time to a record on a distributed file is of 0.12ms. - It is 100 times faster than that with a traditional file on disc. • Linear scalability : The insertion time and the access time per tuple remains constant when the file size and the number of servers increase.
DB2-SDDS performance measurements (i) access time to the data in a DB2 table, (ii) access time to SDDS file from the DB2 external functions (DB2-SDDS) and (iii) direct access time to SDDS file from a SDDS client. Elapsed time of range query Time per tuple
Discussion • Access time to SDDS file is much faster than the access time to a DB2 table: 0.02ms versus 0.07ms. • Access time to external data from DB2 (0.08ms), is less fast than the access to the internal data (0.07ms). Coupling cost • An application has : - fast direct access to the data - through the DBMS, access by the query language
Conclusion • We have coupled a SDDS manager with a main-memory DBMS AMOS-II and DB2 to improve the current technologies for high-performance databases and for the coupling with external data repositories. • The experiments we have reported in the Thesis prove the efficiency of the system. • AMOS-SDDS et DB2-SDDS : use of a SDDS file by a DBMS and the parallel query processing on the server sites. • SD-AMOS :appears as a scalable generalisation of a parallel main-memory DBMS where the data partitioning becomes automatic.
Future Work • Other types of DBMS queries. • Client's scalable distributed query decomposer. • challenging appears the design of a scalable distributed query optimizer handling the dynamic data partitioning.
End Thank You for Your Attention CERIA Université Paris IX Dauphine Yakham.Ndiaye@dauphine.fr