210 likes | 401 Views
Developing Scalable High Performance Petabyte Distributed Databases. CHEP ‘98 Andrew Hanushevsky SLAC Computing Services Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy. BaBar & The B-Factory. High precision investigation of B-meson decays
E N D
Developing Scalable High PerformancePetabyte Distributed Databases CHEP ‘98 Andrew Hanushevsky SLAC Computing Services Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy
BaBar & The B-Factory • High precision investigation of B-meson decays • Cosmic ray tracking starts October 1998 • Experiment starts April 1999 • 500 physicists collaborating from >70 sites in 10 countries • USA, Canada, China, France, Germany, Italy, Norway, Russia, UK, Taiwan • The experiment produces large quantities of data • 200 - 400 TB/year for 10 years • Data stored as objects using Objectivity • Heavy computational load • 5,000 SpecInt95’s • 526 Sun Ultra 10’s or 312 Alpha PW600’s • Work will be distributed across the collaboration
Handling The Data & Computation RS/6000-F50’s AIX 4.2 Sun ES10000 Veritas FS/VM Sun Ultra 2’s Solaris 2.6 Sun ES4500’s Veritas FS/VM Solaris 2.5 HPSS Compute Farm Network Switch AMS Farm External Collaborators
disk High Performance Storage System T a p e hpss app Control Network #Bitfile Server #Name Server #Storage Servers #Physical Volume Library # Physical Volume Repositories #Storage System Manager #Migration/Purge Server #Metadata Manager #Log Daemon #Log Client #Startup Daemon #Encina/SFS #DCE mover mover mover Data Network
client ams disk Advanced Multithreaded Server • Client/Server Application • Serves “pages” (512 to 64K byte blocks) • Similar to other remote filesystem interfaces (e.g., NFS) • Objectivity client can read and write database “pages” via AMS • Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.) • Enables Data Replication Option (DRO) • Enables Fault Tolerant Option (FTO) ufs protocol ams protocol
Volume Manager RAID RAID RAID RAID RAID Veritas File System & Volume Manager • Volume Manager • Catenates disk devices to form very large capacity logical devices • Also s/w RAID-0,1,5 and dynamic I/O multi-pathing • File System • High performance journaled file system for fast recovery • Maximizes device speed/size performance (30+ MB/Sec for h/w RAID-5) • Supports 1TB+ files and file systems File System
Together Alone …. • Veritas Volume Manager + Veritas File System • Excellent I/O performance (10 - 30 MB/Sec) but • Insufficient capacity (1TB) and online cost too high • AMS • Efficient database protocol and highly flexible but • Limited security, low scalability, tied to local filesystem • HPSS • Highly scalable, excellent I/O performance for large files but • High latency for small block transfers (i.e., Objectivity/DB) • Need to synergistically mate these three systems but • Want to keep them independent so any can be changed
ufs hpss The Extensible AMS ams oofs interface glue System specific interface ooss vfs vfs hpss security
An Object Oriented Interface class oofsDesc { // General File System Methods } class oofsDir { // Directory-Specific Methods } class oofsFile { // File-Specific Methods }
The oofs Interface • Provides a standard interface for AMS to get at a filesystem • Any filesystem can be used that can implement the functions: • close getsize remove • closedir open rename • exists opendir sync • getmode read truncate • getsectoken readdir write • Includes all current POSIX-like filesystems • The oofs interface is linked with AMS to create an executable • Normally transparent to client applications • Timing may not be transparent
The HPSS Interface • HPSS implements a “POSIX” filesystem • The HPSS API library provides sufficient oofs functionality • close() hpss_Close() • closedir() hpss_Closedir() • exists() hpss_Stat() • getmode() hpss_Stat() • getsectoken() not applicable • getsize() hpss_Fstat() • open() hpss_Open() [+ hpss_Create() ] • opendir() hpss_Opendir() • read() hpss_SetFileOffset() + hpss_Read() • readdir() hpss_Readdir() • remove() hpss_Unlink() • rename() hpss_Rename() • sync() not applicable • truncate() hpss_Ftruncate() • write() hpss_SetFileOffset() + hpss_Write()
app ams security Additional Issues • Security • Performance • Access patterns (e.g., random vs sequential) • HPSS staging latency • Scalability ooss vfs hpss security
Object Based Security Model • Protocol Independent Client Authentication Model • Public or private key • PGP, RSA, Kerberos, etc. • Can be negotiated at run-time • Provides for server authentication • AMS Client must call a special routine to enable security • oofs_Register_Security() • Supplied routine responsible for creating the oofsSecurity object • Client Objectivity Kernel creates security objects as needed • Security objects supply context-sensitive authentication credentials • Works only with Extensible AMS via oofs interface
Supplying Performance Hints • Need additional information for optimum performance • Different from Objectivity clustering hints • Database clustering • Processing mode (sequential/random) • Desired service levels • Information is Objectivity independent • Need a mechanism to tunnel opaque information • Client supplies hints via oofs_set_info() call • Information relayed to AMS in a transparent way • AMS relays information to underlying file system via oofs()
Dealing With Latency • Hierarchical filesystems may have high latency bursts • Mounting a tape file • Need mechanism to notify client of expected delay • Prevents request timeout • Prevents retransmission storms • Also allows server to degrade gracefully • Can delay clients when overloaded • Defer Request Protocol • Certain oofs() requests can tell client of expected delay • For example, open() • Client waits indicated amount of time and tries again
Balancing The Load I • Dynamically distributed databases • Single machine can’t manage over a terabyte of disk cache • No good way to statically partition the database • Dynamically varying database access paths • As load increases, add more copies • Copies accessed in parallel • As load decreases, remove copies to free up disk space • Objectivity catalog independence • Copies managed outside of Objectivity • Minimizes impact on administration
Balancing The Load II • Request Redirect Protocol • oofs () routines supply alternate AMS location • oofs routines responsible for update synchronization • Typically, read/only access provided on copies • Only one read/write copy conveniently supported • Client must declare intention to update prior to access • Lazy synchronization possible • Good mechanism for largely read/only databases • Load balancing provided by an AMS collective • Has one distinguished member recorded in the catalogue
ams ams ams ams ams ams ams ams client The AMS Collective Collective members are effectively interchangeable AMS Collective 1 AMS Collective 2 redirect Distinguished Members
Overall Effects • Extensible AMS • Allows use of any type of filesystem via oofs layer • Generic Authentication Protocol • Allows proper client identification • Opaque Information Protocol • Allows passing of hints to improve filesystem performance • Defer Request Protocol • Accommodates hierarchical filesystems • Redirection Protocol • Accommodates terabyte+ filesystems • Provides for dynamic load balancing
vfs vfs vfs Dynamic Load Balancing Hierarchical Secure AMS ams Redwood ams Dynamic Selection hpss client Redwood ams Redwood
Summary • AMS is capable of high performance • Ultimate performance limited by disk speeds • Should be able to deliver average of 20 MB/Sec per disk • The oofs interface + other protocols greatly enhance performance, scalability, usability, and security • SLAC will be using this combination to store physics data • BaBar experiment will produce over a 2 PB database in 10 years • 2,000,000,000,000,000 = 2´1015 bytes @ 200,000 3590 Tapes