GORDA Kickoff meeting INRIA – Sardes project

GORDA Kickoff meetingINRIA – Sardes project Emmanuel Cecchet Sara Bouchenak

Outline • INRIA, ObjectWeb & Sardes • GORDA

INRIA key figures • A public scientific and technological research institute • in computer science and control • under the dual authority of the Ministry of Research and the Ministry of Industry Jan. 2003 A scientific force of 3,000 • 900 permanent staff • 400 researchers • 500 engineers, technical and administrative • 450 researchers from other organizations • 700 Ph.D students • 200 external collaborators • 750 trainees, post-doctoral students, visiting researchers from abroad • (universities or industry)) INRIA Rhône-Alpes 6 Research Units Budget: 120 M€ (tax not incl.)

iCluster 2 • Itanium-2 processors • 104 nodes (Dual 64 bits 900 MHz processors, 3 GB memory, 72 GB local disk) connected through a Myrinet network • 208 processors, 312 GB memory, 7.5 TB disk • Connected to the GRID • Linux OS (RedHat Advanced Server) • First Linpack experiments at INRIA (Aug. 03) have reached 560 GFlop/s • Applications : Grid computing,classical scientific computing, high performance Internet servers, …

ObjecWeb key figures • Open source middleware development • Based on open standard • J2EE, CORBA, OSGi… • International consortium • Founded by INRIA, Bull and FT R&D in 2001 • Academic partners • European universities and research centers • Industrial partners • RedHat, Suse, MySQL, … • NEC, Bull, France Telecom, Dassault, Cap Gemini, …

Applications OSGi J2EE CCM ... Application Servers Dedicated Middleware Platform Interoperability Transactions Deployment Persistence Tools Components Frameworks ... Database Connectivity Composition JVM ... ... Hardware & OS Networked Resources Common Software & Architecture for Component Based Development JMOB JOnAS OSCAR RUBiS OpenCCM ProActive Speedo JORAM DotNetJ CAROL Enhydra XMLC JORM/MEDOR JOTM OSCAR Kilim Zeus C-JDBC Fractal Jonathan Bonita RmiJdbc Think JAWE Octopus

Sardes project • Distributed Systems group • Main research themes • Reflective component technology • Autonomous systems management • Applications areas • high-availability J2EE servers • dynamic monitoring, configuration and resource management in large scale distributed systems • (embedded system networks, ubiquitous computing) • Result dissemination by ObjectWeb

Outline • INRIA, ObjectWeb & Sardes • GORDA

Sardes experiences • Component-based open source middleware • ObjectWeb (http://www.objectweb.org) • J2EE application servers • JOnAS clustering (http://jonas.objectweb.org) • Database replication middleware • C-JDBC (http://c-jdbc.objectweb.org) • Benchmarking • RUBiS (http://rubis.objectweb.org) • TPC-W (http://jmob.objectweb.org) • CLIF (http://clif.objectweb.org) • Monitoring • LeWYS (http://lewys.objectweb.org)

Database Well-known database vendor here Well-known hardware + database vendors here Common scalability practice • Cons • Cost • Scalability limit App. server Web frontend Internet

Replication with shared disks • Cons • still expensive hardware • availability App. server Disks Database Web frontend Internet Another well-known database vendor

Master/Slave replication • Cons • consistency • failover time on master failure • scalability App. server Master Web frontend Internet

Atomic broadcast-based replication • Database tier should be • scalable • highly available • without modifying the client application • database vendor independent • on commodity hardware Internet Atomic broadcast

JDBC JDBC C-JDBC • JDBC compliant (no client application modification) • database vendor independent • JDBC driver required • heterogeneity support • no 2PC, no group communication between databases • group communication for controller replication only JDBC Internet

RAIDb - Definition • Redundant Array of Inexpensive Databases • better performance and fault tolerance than a single database, at a low cost, by combining multiple database instances into an array of databases • RAIDb levels offers various tradeoff of performance and fault tolerance

RAIDb • Redundant Array of Inexpensive Databases • better performance and fault tolerance than a single database, at a low cost, by combining multiple database instances into an array of databases • RAIDb controller • gives the view of a single database to the client • balance the load on the database backends • RAIDb levels • RAIDb-0: full partitioning • RAIDb-1: full mirroring • RAIDb-2: partial replication • composition possible

C-JDBC – Key ideas • Middleware implementing RAIDb • Two components • generic JDBC 2.0 driver (C-JDBC driver) • C-JDBC Controller • C-JDBC Controller provides • performance scalability • high availability • failover • caching, logging, monitoring, … • Supports heterogeneous databases

C-JDBC – Overview

Heterogeneity support • unload a singleOracle DB withseveral MySQL • RAIDb-2 forpartial replication

Inside the C-JDBC Controller Sockets Sockets JMX

C-JDBC features • unified authentication management • tunable concurrency control • automatic schema detection • tunable replication: full partitioning, partial replication, full replication • caching: metadata, parsing, result with various invalidation granularities • various load balancing strategies • on-the-fly query rewriting for macros and heterogeneity support • recovery log for dynamic backend adding and failure recovery • database backup/restore using Octopus • JMX based monitoring and administration • graphical administration console

connectmyDB connectlogin, password executeSELECT * FROM t Functional overview

executeINSERT INTO t … Functional overview

Failures • No 2 phase-commit • parallel transactions • failed nodes are automatically disabled executeINSERT INTO t …

Controller replication jdbc:c-jdbc://node1:25322,node2:12345/myDB • Prevent the controller from being a single point of failure • Group communication for controller synchronization • C-JDBC driver supports multiple controllers with automatic failover

jdbc:cjdbc://node1,node2/myDB Uniform total order reliable multicast Controller replication

Mixing horizontal & vertical scalability

Lessons learned • SQL parsing cannot be generic • many discrepancies in JDBC implementations • minimize the use of group communications • IP multicast does not scale • notification infrastructure needed • users want • no single point of failure • control (monitoring, plug-able recovery policies, …) • no database vendor locking • no database modification • need for an exhaustive test suite • benchmarking accurately is very difficult • load injection requires resources • monitoring and exploiting results is tricky

Sardes role in GORDA • provide input • GORDA APIs • group communication requirements • monitoring and management requirements • middleware implementation based on C-JDBC • dissemination effort • ObjectWeb • possible participation to JCP for JDBC extensions • hardware resources for experiments • eCommerce benchmarks

Other interests • LeWYS (http://lewys.objectweb.org) • monitoring infrastructure • generic hardware/kernel probes for Linux/Windows • software probes: JMX, SNMP, … • monitoring repository • autonomic behavior • building supervision loops • self-healing clusters • self-sizing (expand or shrink) • SLAs

Q without A • do we consider distributed query execution? • XA support? • cluster size targeted? • do we target grids or cluster of clusters? • reconciliation • consistency/caching • network architecture considered? • are relaxed or loose consistency options? • what will really cover the GRI? • do we impose a specific way of doing replication? • access to read-set/write-set difficult to implement with legacy databases • which workloads are considered? • which WP deals with backup/recovery? • licensing issues?

Q&A_________Thanks to all users and contributors ...

Bonus slides

INTERNALS

Virtual Database • gives the view of a single database • establishes the mapping between the database name used by the application and the backend specific settings • backends can be added and removed dynamically • configured using an XML configuration file

Authentication Manager • Matches real login/password used by the application with backend specific login/ password • Administrator login to manage the virtual database

Scheduler • Manages concurrency control • Specific implementations for Single DB, RAIDb 0, 1 and 2 • Query-level • Optimistic and pessimistic transaction level • uses the database schema that is automatically fetched from backends

Request cache • caches results from SQL requests • improved SQL statement analysis to limit cache invalidations • table based invalidations • column based invalidations • single-row SELECT optimization • request parsing possible in theC-JDBC driver • offload the controller • parsing caching in the driver

Load balancer 1/2 • RAIDb-0 • query directed to the backend having the needed tables • RAIDb-1 • read executed by current thread • write executed in parallel by a dedicated thread per backend • result returned if one, majority or all commit • if one node fails but others succeed, failing node is disabled • RAIDb-2 • same as RAIDb-1 except that writes are sent only to nodes owning the written table

Load balancer 2/2 • Static load balancing policies • Round-Robin (RR) • Weighted Round-Robin (WRR) • Least Pending Requests First (LPRF) • request sent to the node that has the shortest pending request queue • efficient if backends are homogeneous in terms of performance

Connection Manager • Connection pooling for a backend • Simple : no pooling • RandomWait : blocking pool • FailFast : non-blocking pool • VariablePool : dynamic pool • Connection pools defined on a per login basis • resource management per login • dedicated connections for admin

Recovery Log • Checkpoints are associated with database dumps • Record all updates and transaction markers since a checkpoint • Used to resynchronize a database from a checkpoint • JDBCRecoveryLog • store information in a database • can be re-injected in a C-JDBC cluster for fault tolerance

SCALABILITY

C-JDBC scalability • Horizontal scalability • prevents the controller to be a Single Point Of Failure (SPOF) • distributes the load among several controllers • uses group communications for synchronization • C-JDBC Driver • multiple controllers automatic failover • jdbc:c-jdbc://node1:25322,node2:12345/myDB • connection caching • URL parsing/controller lookup caching

C-JDBC scalability • Vertical scalability • allows nested RAIDb levels • allows tree architecture for scalable write broadcast • necessary with large number of backends • C-JDBC driver re-injected in C-JDBC controller

C-JDBC vertical scalability • RAIDb-1-1with C-JDBC • no limit tocompositiondeepness

C-JDBC vertical scalability • RAIDb-0-1with C-JDBC

CHECKPOINTING

Fault tolerant recovery log UPDATE statement

Checkpointing • Octopus is an ETL tool • Use Octopus to store a dump of the initial database state • Currently done by the user using the database specific dump tool

GORDA Kickoff meeting INRIA – Sardes project