160 likes | 301 Views
Chapter 1.3: Data Models and DBMS Architecture. Title: Anatomy of a Database System Authors: J. Hellerstein, M. Stonebraker Pages: 43-95 . Anatomy of a Database System. Problem Problem Statement Why is this problem important? Why is this problem hard? Approaches
E N D
Chapter 1.3: Data Models and DBMS Architecture • Title: Anatomy of a Database System • Authors: J. Hellerstein, M. Stonebraker • Pages: 43-95
Anatomy of a Database System • Problem • Problem Statement • Why is this problem important? • Why is this problem hard? • Approaches • Approach description, key concepts • Contributions (novelty, improved) • Assumptions
Problem Statement – DBMS Architecture • Given • A data model • Platform, i.e. operating system, computer hardware architecture • Find - An DBMS architecture • A set of building-block components • Interactions among building blocks • Objectives • Efficiency, Scalability • Extensibility • Constraints • Relational Data Model
Why is this problem important? • Why review Relational DBMS architectural innovations? • Backbone of infrastructure applications • Banking, airline reservation, medical records, CRM, SCM, … • Well-understood point of reference for • New extensions and future revolution • Architecture allows • Analysis of properties • Availability, fault-tolerance, reliability • Mapping of multiple views • User requirements to components - validation and acceptance tests • Software developers, maintainer, … • Software operational support group
Why is this problem Hard? • Complexity • Mid-1970s – Efficient implementation of a Relational DBMS • Declarative Query Language • Logical and physical independence • Changes • Platforms evolve • Computer Hardware, Languages, Operating Systems • Storage: Tapes Disks (1960s) RAID (1990s) SAN … • CPUs: Mainframe Mini Desktops Multi-core CPUs (2000s) • … • Integrate many views • Enterprise – performance level, transaction reliability, … • Data Processing Needs – data warehouses, reports, OLTP, Web,… • …
Contributions, Validation Methodology • Contributions • A simple yet relatively comprehensive RDBMS architecture • Decomposition into 4 components • Identification of depedencies • Validation • Ability to explain academic and commercial RDBMSs • Expert opinion, authors have architected multiple DBMSs
Proposed Approach • Four Components (Figure 1, pp. 44) • A Process Manager • Query Processing Engine • Transactional Storage Subsystem • Shared Utilities, e.g. Disk space management • Interactions among components • Not explicit in Figure 1 • Implicit: • Left-top to lower-right flow
Component 1 – Process Manager • Responsibilities - Organization of processes • Platform: Uni-processor, High-performance OS threads • Two Options • Process per user (connection) • Issues - scalability • Server Process (+ I/O Process per disk) • Dispatcher thread, log manager thread • Pool of worker threads • Shared data (e.g. log, I/O buffer) in common heap space • Issues – asynchronous I/O, protection across threads, … • Client – Server communication • network socket • Q? What is new in this paper relative to Parallel Database paper by DeWitt et al.?
Component 1 – Issues • Mapping DBMS threads to OS Processes • Absence of OS threads – page 50 • Commercial examples – last para, sec. 2.2.1, page 51 • Parallelism (Figures 5-7, pp. 52-54) • Shared memory – previous architectures port easily • Shared nothing • Query processing parallelizes w/ horizontal data partitioning • 2 phase commit need communication • Partial failure • Shared disk • Distributed lock manager, cache coherency protocol, … • Admission Control • Avoid thrashing ( working set > memory buffers) • Control number of connections, number of queries
Component 2 – Query Processor • Responsibility: • SQL query execution plan (Fig. 8, pp. 64) • Subcomponents • Parsing and Authorization • Catalogs • Query rewrite – views, constant expressions, semantic optimization, sub-query flattening • Optimizer – plan space, selectivity estimation, search, parallelism, extensibility, auto-tuning, … • Executor – iterator model (Figure 9, pp. 68) • Q? What is new in optimizer since Selinger ?
Component 2 – Query Processor Issues • Data Modification Statements • Plans are more complex • Ex. Halloween problem (Fig. 10, pp. 71) • Access Methods • Unordered files, B+-tree, R-tree and bit-map indexes • API methods – init(), get_next(), … • Search by logical conditions (sarg) or record-id • Interacts with concurrency and recovery sub-components
Component 3 – Transactional Storage Manager • Responsibilities – ACID properties • Subcomponents • Lock Manager • Serializability, 2PL, Isolation levels (p. 76) • Log Manager • WAL – 3 rules (p. 78), performance tuning • Buffer pool • Access methods • Latches in B+trees (p. 80) – conservative, latch-coupling, right-link • Predicate locks – next-key locking
Component 3 – Transactional Storage Manager • Interdependencies among subcomponents • Lock Manager, Log Manager • WAL assume strict 2PL (p. 82) • Q? What would happen without strict 2PL ? • Concurrency control, Access Methods • Methods are unique to index types
Component 4 – Shared Utilities • Sub-components • Memory allocator (p. 84) • Disk management subsystem • Map tables to devices or files • New issues with RAIDs (p. 86-87) • Replication services • Physical, trigger based, log-based • Batch utilities • Optimizer statistics gathering, backup/export, physical reorg and index construction
Summary • Paper’s focus • DBMS Architectures – components and dependencies • Insights - Four Components (Figure 1, pp. 44) • A Process Manager • Query Processing Engine • Transactional Storage Subsystem • Shared Utilities, e.g. Disk space management • Interactions among components • Not explicit in Figure 1 • Q. List a few discussed in the paper!
Assumptions, Rewrite today • Assumptions • Focus on Relational DBMS • Centralized DBMS (Recall T2.6 on R*) • Four component architecture reminds one of Ingres! • Lessons translate over to new domains • Rewrite today • Cover a post-relational DBMS, e.g. Stream or XML • Illustrate how lessons translate over web-services, e-mail repositories, network monitors, etc.