1 / 24

Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45

Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45. Aim of this Study Architectural Overview Espresso Components Prototype Status & Plans. Why Espresso?. RD45 Risk Analysis Milestone

shakti
Download Presentation

Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Espresso - a Feasibility Study of a Scalable, Performant ODBMSDirk DuellmannCERN IT/DB and RD45 Aim of this Study Architectural Overview Espresso Components Prototype Status & Plans

  2. Why Espresso? • RD45 Risk Analysis Milestone • Understand the effort needed to develop a ODBMS suitable as fallback solution for LHC data stores • Testbed that allows us to test novel solutions for remaining problems • e.g. VLDB issues, asynchronous I/O, user schema & data, modern C++ binding, ... • NO plans to stop Objectivity production service!

  3. Could a home grown ODBMS be feasible? • Most Database kernels have been developed in “C” the late 80s and before • Today all main design choices are extensively studied in the computer science literature • C++ Language and Library provide am much better development platform than C • Our specific requirements are better understood • We know much better what we need (and not need). • We could reuse HEP developments in many areas like mass storage interface, security • Building an ODBMS for HEP is an engineering and not a research task • We don’t need to spend O(150) person years which went into the first ODBMS!

  4. System Requirements • Scalability • in data volume and number of client connections • Navigational Access • with performance close network and disk limits • Heterogeneous Access • from multiple platforms and languages • Transactional Safety & Crash Recovery • automatic consistency after soft/hardware failures

  5. A Clean Sheet Approach - What should/could be done differently? • No need for big architectural changes • Objectivity/DB largely fulfils our functional requirements • Migration would be easier if access model is similar (e.g ODMG-like) • Focus on remaining problems • Improved Scalability & Concurrency of the Storage Hierarchy • Larger address space (VLDB) • Segmented and more scalable schema & catalogue • Improved Support for HEP environment • parallel development - concept of user/developer sandbox within the store needed • Simplify Partial Distribution of the Data Store • import export consistent subsets of the store

  6. Flexible Storage Hierarchy • File - Group of physically clustered objects • Smallest possible Espresso store • Contains data and optionally schema • Fast navigation within the file using physical OIDs • Domain - Group of files with tightly coupled objects • Contains domain catalogue, data and additional schema • Navigation between all objects within the domain using physical OIDs • Federation - Group of weakly coupled domains • Domain catalogue (very few updates!) • Shared schema (very few updates!)

  7. P n RAW P 1 RAW Period 1 Domain Catalogue Period N Domain Catalogue P 1 REC P n REC Read-Only Domain (no locking required) P 1 AOD P n AOD Production Server FD Catalogue Calib Server Calib ECAL Calib HCAL Calib TPC Calib Domain Catalogue Production Schema User 1 “sandbox” User 1 Tags User 1 Histos User 1 Domain Catalogue User 1 MyTracks Schema myTrack

  8. Federation Domain File Page Object Espresso OID Layout • Federation • set of weakly coupled domains • Domain# 32bit • set of tightly coupled objects • e.g. a run or run period, a end-user workspace • File# 16bit • a single file within a domain • Page# 32bit • a single logical page in the file • Object# 16bit • a single data record on a page e.g. a object or varray

  9. Prototype Implementation • Espresso is implemented in standard C++ • no other dependencies • (for now we use portable network I/O from ObjectSpace) • Expect a full C++ compiler • STL containers • in fact all containers in the current implementation are STL containers • Exceptions • C++ binding uses exceptions to signal error conditions (conforming to ODMG standard) • Namespaces • All of the implementation is contained in namespace “espresso” • C++ binding is in namespace "odmg” • Development Platform: RedHat Linux & g++

  10. Component Approach • Espresso is split into a small set of replaceable components with well defined • task • interface • dependency on other components • Common Services • Storage Manager • Schema Manager • Catalogue Manager • Data Server • Lock Server • C++ & Python Binding, (JAVA)

  11. PageServer LockServer Locktable Page I/O Net I/O File I/O Python Binding JAVA Binding SchemaMgr StorageMgr CatalogMgr C++ Binding TransMgr Toplevel Components User API Tool Interface Storage Level Interface depends on OS & Network Abstraction Distribution

  12. Components: Physical Model • Each top-level component corresponds to one shared library and namespace • shared lib dependencies follow category diagram • components are isolated in their namespace • from other components • from user classes • Each shared lib provides IComponent interface • Factory for main provided interfaces • Version and configuration control on component level • implementation version, date and compiler version • boolean flags for optimised, debug, profiling

  13. Client Side Components • Storage Manager • store and retrieve variable length opaque data objects • maintains OIDs for data objects • implements transactional safety • language and platform independent • current implementation uses “shadow-paging” to implement transactions • Schema Manager • describe the layout of data types • data member position, size and type, byte ordering for primitive types • used for: • Platform Conversion, Generic Browsing, Schema Consistency • current implementation extracts schema from the debug information provided directly by the compiler • no schema pre-processor required

  14. Server Side Components • Data Server • transfer data pages from persistent storage (disk/tape) to memory • file system like interface • trivial implementation for local I/O • multi-threaded server daemon for remote I/O • Lock Server • keep a central table of resource locks • getLock (oid) • implements lock waiting and upgrading • very similar approach to most DBMS • Hash Table of resource locks (resource specified as OID) • Queue of waiters per locked resource • moderate complexity: storage manager implements “real” transaction logic

  15. C++ Language Binding • Support all main language features • Including polymorphic access and templates • No language extensions, Nogenerated code • ODMG 2.0 compliant C++ Binding • Ref templates can be sub-classed to extend their behavior • e.g. d_Ref could be extended to monitor object access counts • large fraction of the binding has already been implemented • smart pointers can point to transient objects • persistent capable classes may be embedded into other persistent classes • d_activate and d_deactivate are implemented • design supports multiple DB contexts per process • e.g. for multi-threaded applications and mutiple federations • Work in progress: • B-Tree indices, bi-directional links, installable adapters for persistent objects

  16. First Scalability & Performance Tests • Page Server • up to 70 concurrent clients • Lock Server • up to 150 concurrent clients, up to 3000 locks • Storage Manager • Files up to 2 GB (ext2 file system limit under LINUX) • 100 million objects per file • stress tested with “random” bit-patterns • Objects up to 10 MB size • Write Performance: > 40MB/s at 30% CPU • 450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1 • C++ Binding and Schema Handling • successfully ported several non-trivial applications • HTL histogram examples, simple object browser using python • tagDb and naming examples from HepODBMS

  17. Next Steps • Start detailed requirement discussion with experiments and other interested institutes • Continue Scalability & Performance Test • Storage Manager: larger files (>100GB) • Page Server: connections > 500 • Lock Server: number of locks > 20k • C++ Binding & Schema Manager: port Geant4 persistency examples and Conditions-DB • By summer this year • Written Architectural Overview of the Prototype • Development Plan with detailed estimate of required manpower • Single user toy-system

  18. Summary & Conclusions • We identified solutions for most critical components of a scalable and performant ODBMS • Prototype implementation shows promising performance and scalability • Using a strict component approach allows to split the effort into independently developed, replaceable modules. • The development of an Open Source ODBMS seems possible within the HEP or general science community • A collaborative effort of the order of 15 person years seems sufficient to produce such a system with production quality

  19. The End

  20. Exploit Read-Only Data • Most of our data volume follows the pattern • (private) write-once, • share read-only • e.g. raw data is never updated, reconstructed data is not updated but replaced • Current ODBMS implementations do not really take advantage of this fact • read-only files • no need to obtain any locks for this data • no need to ever update cache content • simple backup strategy • Using the concept of read-only files • e.g. in the catalogue • should significantly reduce the locking overhead and improve the scalability of the system with many concurrent clients

  21. 1 2 3 6 7 8 Transactions and Recovery • Shadow Paging • Physical pages on disk are accessed indirectly through a translation table (page map). • Copy-on-Write : page modifications are always written to a new, free physical page • Changed physical pages are made visible to other transactions by updating the page map at commit time. Master PageMap 1 Data 2 Data 3 Data 4 Data 5 PageMap 2

  22. Advantages of this Approach • Single files or complete domains can be used stand-alone without modification • e.g. set of user files containing tags and histograms • Local OIDs could be stored in a more compact form • transparent expansion into a full OID as they are read into memory • “Attaching” or direct sharing of files or complete domains does not need any special treatment • no OID translation needed • read-only files/domains can directly be shared by multiple federations • Domains allow to segment the store into “coherent regions” of associated objects • Efficient distribution, backup and replication of subsets of the data (e.g. a run period, a set of user tracks) • Consistency checks can be constrained to a single domain

  23. Common Services • Services and Interfaces of global visibility • OID, IStorageMgr,IPageServer,ILockServer, ISchemaMgr • Platform & OS abstraction • fixed range types, I/O primitives, process control • component interface • version & configuration control • component factory • extendible diagnostics • named counters, timers to instrument the code • each component may have a sub-tree diagnostic items • error & debug message handler • syslog like: component, level, message • exception base class

  24. Espresso Schema Extraction • Currently implemented • extraction based on the “stabs” standard format for debugging information (used by egcs and Sun CC) • based on GNU “BDF” library and “objdump” source code • Prototype provides full runtime reflection for C++ data • describes classes and structs with their fields and inheritance • supports namespaces, typedefs and enums and templates • location and value of virtual function and virtual base class pointers • sufficient to allow runtime field by field consistency check against persistent schema • Starting of a modified egcs front-end as schema extractor would be an alternative

More Related