Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45

Espresso - a Feasibility Study of a Scalable, Performant ODBMSDirk DuellmannCERN IT/DB and RD45 Aim of this Study Architectural Overview Espresso Components Prototype Status & Plans

Why Espresso? • RD45 Risk Analysis Milestone • Understand the effort needed to develop a ODBMS suitable as fallback solution for LHC data stores • Testbed that allows us to test novel solutions for remaining problems • e.g. VLDB issues, asynchronous I/O, user schema & data, modern C++ binding, ... • NO plans to stop Objectivity production service!

Could a home grown ODBMS be feasible? • Most Database kernels have been developed in “C” the late 80s and before • Today all main design choices are extensively studied in the computer science literature • C++ Language and Library provide am much better development platform than C • Our specific requirements are better understood • We know much better what we need (and not need). • We could reuse HEP developments in many areas like mass storage interface, security • Building an ODBMS for HEP is an engineering and not a research task • We don’t need to spend O(150) person years which went into the first ODBMS!

System Requirements • Scalability • in data volume and number of client connections • Navigational Access • with performance close network and disk limits • Heterogeneous Access • from multiple platforms and languages • Transactional Safety & Crash Recovery • automatic consistency after soft/hardware failures

A Clean Sheet Approach - What should/could be done differently? • No need for big architectural changes • Objectivity/DB largely fulfils our functional requirements • Migration would be easier if access model is similar (e.g ODMG-like) • Focus on remaining problems • Improved Scalability & Concurrency of the Storage Hierarchy • Larger address space (VLDB) • Segmented and more scalable schema & catalogue • Improved Support for HEP environment • parallel development - concept of user/developer sandbox within the store needed • Simplify Partial Distribution of the Data Store • import export consistent subsets of the store

Flexible Storage Hierarchy • File - Group of physically clustered objects • Smallest possible Espresso store • Contains data and optionally schema • Fast navigation within the file using physical OIDs • Domain - Group of files with tightly coupled objects • Contains domain catalogue, data and additional schema • Navigation between all objects within the domain using physical OIDs • Federation - Group of weakly coupled domains • Domain catalogue (very few updates!) • Shared schema (very few updates!)

P n RAW P 1 RAW Period 1 Domain Catalogue Period N Domain Catalogue P 1 REC P n REC Read-Only Domain (no locking required) P 1 AOD P n AOD Production Server FD Catalogue Calib Server Calib ECAL Calib HCAL Calib TPC Calib Domain Catalogue Production Schema User 1 “sandbox” User 1 Tags User 1 Histos User 1 Domain Catalogue User 1 MyTracks Schema myTrack

Federation Domain File Page Object Espresso OID Layout • Federation • set of weakly coupled domains • Domain# 32bit • set of tightly coupled objects • e.g. a run or run period, a end-user workspace • File# 16bit • a single file within a domain • Page# 32bit • a single logical page in the file • Object# 16bit • a single data record on a page e.g. a object or varray

Prototype Implementation • Espresso is implemented in standard C++ • no other dependencies • (for now we use portable network I/O from ObjectSpace) • Expect a full C++ compiler • STL containers • in fact all containers in the current implementation are STL containers • Exceptions • C++ binding uses exceptions to signal error conditions (conforming to ODMG standard) • Namespaces • All of the implementation is contained in namespace “espresso” • C++ binding is in namespace "odmg” • Development Platform: RedHat Linux & g++

Component Approach • Espresso is split into a small set of replaceable components with well defined • task • interface • dependency on other components • Common Services • Storage Manager • Schema Manager • Catalogue Manager • Data Server • Lock Server • C++ & Python Binding, (JAVA)

PageServer LockServer Locktable Page I/O Net I/O File I/O Python Binding JAVA Binding SchemaMgr StorageMgr CatalogMgr C++ Binding TransMgr Toplevel Components User API Tool Interface Storage Level Interface depends on OS & Network Abstraction Distribution

Components: Physical Model • Each top-level component corresponds to one shared library and namespace • shared lib dependencies follow category diagram • components are isolated in their namespace • from other components • from user classes • Each shared lib provides IComponent interface • Factory for main provided interfaces • Version and configuration control on component level • implementation version, date and compiler version • boolean flags for optimised, debug, profiling

Client Side Components • Storage Manager • store and retrieve variable length opaque data objects • maintains OIDs for data objects • implements transactional safety • language and platform independent • current implementation uses “shadow-paging” to implement transactions • Schema Manager • describe the layout of data types • data member position, size and type, byte ordering for primitive types • used for: • Platform Conversion, Generic Browsing, Schema Consistency • current implementation extracts schema from the debug information provided directly by the compiler • no schema pre-processor required

Server Side Components • Data Server • transfer data pages from persistent storage (disk/tape) to memory • file system like interface • trivial implementation for local I/O • multi-threaded server daemon for remote I/O • Lock Server • keep a central table of resource locks • getLock (oid) • implements lock waiting and upgrading • very similar approach to most DBMS • Hash Table of resource locks (resource specified as OID) • Queue of waiters per locked resource • moderate complexity: storage manager implements “real” transaction logic

C++ Language Binding • Support all main language features • Including polymorphic access and templates • No language extensions, Nogenerated code • ODMG 2.0 compliant C++ Binding • Ref templates can be sub-classed to extend their behavior • e.g. d_Ref could be extended to monitor object access counts • large fraction of the binding has already been implemented • smart pointers can point to transient objects • persistent capable classes may be embedded into other persistent classes • d_activate and d_deactivate are implemented • design supports multiple DB contexts per process • e.g. for multi-threaded applications and mutiple federations • Work in progress: • B-Tree indices, bi-directional links, installable adapters for persistent objects

First Scalability & Performance Tests • Page Server • up to 70 concurrent clients • Lock Server • up to 150 concurrent clients, up to 3000 locks • Storage Manager • Files up to 2 GB (ext2 file system limit under LINUX) • 100 million objects per file • stress tested with “random” bit-patterns • Objects up to 10 MB size • Write Performance: > 40MB/s at 30% CPU • 450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1 • C++ Binding and Schema Handling • successfully ported several non-trivial applications • HTL histogram examples, simple object browser using python • tagDb and naming examples from HepODBMS

Next Steps • Start detailed requirement discussion with experiments and other interested institutes • Continue Scalability & Performance Test • Storage Manager: larger files (>100GB) • Page Server: connections > 500 • Lock Server: number of locks > 20k • C++ Binding & Schema Manager: port Geant4 persistency examples and Conditions-DB • By summer this year • Written Architectural Overview of the Prototype • Development Plan with detailed estimate of required manpower • Single user toy-system

Summary & Conclusions • We identified solutions for most critical components of a scalable and performant ODBMS • Prototype implementation shows promising performance and scalability • Using a strict component approach allows to split the effort into independently developed, replaceable modules. • The development of an Open Source ODBMS seems possible within the HEP or general science community • A collaborative effort of the order of 15 person years seems sufficient to produce such a system with production quality

The End

Exploit Read-Only Data • Most of our data volume follows the pattern • (private) write-once, • share read-only • e.g. raw data is never updated, reconstructed data is not updated but replaced • Current ODBMS implementations do not really take advantage of this fact • read-only files • no need to obtain any locks for this data • no need to ever update cache content • simple backup strategy • Using the concept of read-only files • e.g. in the catalogue • should significantly reduce the locking overhead and improve the scalability of the system with many concurrent clients

1 2 3 6 7 8 Transactions and Recovery • Shadow Paging • Physical pages on disk are accessed indirectly through a translation table (page map). • Copy-on-Write : page modifications are always written to a new, free physical page • Changed physical pages are made visible to other transactions by updating the page map at commit time. Master PageMap 1 Data 2 Data 3 Data 4 Data 5 PageMap 2

Advantages of this Approach • Single files or complete domains can be used stand-alone without modification • e.g. set of user files containing tags and histograms • Local OIDs could be stored in a more compact form • transparent expansion into a full OID as they are read into memory • “Attaching” or direct sharing of files or complete domains does not need any special treatment • no OID translation needed • read-only files/domains can directly be shared by multiple federations • Domains allow to segment the store into “coherent regions” of associated objects • Efficient distribution, backup and replication of subsets of the data (e.g. a run period, a set of user tracks) • Consistency checks can be constrained to a single domain

Common Services • Services and Interfaces of global visibility • OID, IStorageMgr,IPageServer,ILockServer, ISchemaMgr • Platform & OS abstraction • fixed range types, I/O primitives, process control • component interface • version & configuration control • component factory • extendible diagnostics • named counters, timers to instrument the code • each component may have a sub-tree diagnostic items • error & debug message handler • syslog like: component, level, message • exception base class

Espresso Schema Extraction • Currently implemented • extraction based on the “stabs” standard format for debugging information (used by egcs and Sun CC) • based on GNU “BDF” library and “objdump” source code • Prototype provides full runtime reflection for C++ data • describes classes and structs with their fields and inheritance • supports namespaces, typedefs and enums and templates • location and value of virtual function and virtual base class pointers • sufficient to allow runtime field by field consistency check against persistent schema • Starting of a modified egcs front-end as schema extractor would be an alternative

Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45