350 likes | 450 Views
CS 405G: Introduction to Database Systems. 21 Storage Chen Qian University of Kentucky. Homeworks. 5 -> 4 To give you more time for the project. Review. Disk management Basic disk (physical) unit: page Basic DB unit: record (row of a table) How to connect page with record? File!.
E N D
CS 405G: Introduction to Database Systems 21 Storage Chen Qian University of Kentucky
Homeworks • 5 -> 4 • To give you more time for the project Chen Qian @ University of Kentucky
Review • Disk management • Basic disk (physical) unit: page • Basic DB unit: record (row of a table) • How to connect page with record? • File! Chen Qian @ University of Kentucky
A row in a table, when located on disks, is called • A record • Two types of record: • Fixed-length • Variable-length Chen Qian @ University of Kentucky
In an abstract sense, a file is • A set of “records” on a disk • In reality, a file is • A set of disk pages • Each record lives on • A page • Physical Record ID (RID) • A tuple of <page#, slot#> Chen Qian @ University of Kentucky
Files • Higher levels of DBMS operate on records, and files of records. • FILE: A collection of pages, containing a collection of records. • Record = row in a table • Must support: • insert/delete/modify record • fetch a particular record (specified using record id) • scan all records (possibly with some conditions on the records to be retrieved) Chen Qian @ University of Kentucky
Unordered (Heap) Files • Simplest file structure contains records in no particular order. • As file grows and shrinks, disk pages are allocated and de-allocated. • To support record level operations, we must: • keep track of the pages in a file • keep track of free spaceon pages • keep track of the records on a page • There are many alternatives for keeping track of this. • We’ll consider 2 Chen Qian @ University of Kentucky
Data Page Data Page Data Page Full Pages Header Page Data Page Data Page Data Page Pages with Free Space Heap File Implemented as a List • The header page id and Heap file name must be stored someplace. • Database “catalog” • Each page contains 2 `pointers’ plus data. Chen Qian @ University of Kentucky
Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Heap File Using a Page Directory • DBMS must remember where the first directory page is located. • The entry for a page can include the number of free bytes on the page. • The directory itself is a collection of pages and shown as a linked list. • Much smaller than linked list of all HF pages! Chen Qian @ University of Kentucky
Trade-off • Compared to linked list, the directory based approaches have: • Pros: Fast access to a particular record • Cons: Extra space Chen Qian @ University of Kentucky
Record layout Record = row in a table • Variable-format records • Rare in DBMS—table schema dictates the format • Relevant for semi-structured data such as XML • Focus on fixed-format records • With fixed-length fields only, or • With possible variable-length fields Chen Qian @ University of Kentucky
Record Formats: Fixed Length F3 F4 F1 F2 • All field lengths and offsets are constant • Computed from schema, stored in the system catalog • Finding i’th field done via arithmetic. L3 L4 L1 L2 Address = B+L1+L2 Base address (B) Chen Qian @ University of Kentucky
0 4 24 28 36 142 Bart(padded with space) 10 2.3 Fixed-length fields • Example: CREATE TABLE Student(SID INT, name CHAR(20), age INT, GPA FLOAT); • Watch out for alignment • May need to pad; reorder columns if that helps • What about NULL? • Add a bitmap at the beginning of the record Chen Qian @ University of Kentucky
Record Formats: Variable Length • Two alternative formats (# fields is fixed): F1 F2 F3 F4 $ $ $ $ Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets • Second offers direct access to i’th field, efficient storage • of nulls; extra but small directory overhead. Chen Qian @ University of Kentucky
Trade-off • Compared to the 1st formats, the offset based format has: • Pros: Fast access to a particular field; efficient storage of nulls • Cons: Extra space Chen Qian @ University of Kentucky
Large Object (LOB) fields • Example: CREATE TABLE Student(SID INT, name CHAR(20), age INT, GPA FLOAT, picture BLOB(32000)); • Student records get “de-clustered” • Bad because most queries do not involve picture • Decomposition (automatically done by DBMS and transparent to the user) • Student(SID, name, age, GPA) • StudentPicture(SID, picture) Chen Qian @ University of Kentucky
Page layout How do you organize records in a page? • Fixed length records • Variable length records • NSM (N-ary Storage Model) is used in most commercial DBMS Chen Qian @ University of Kentucky
Page Formats: Fixed Length Records • Record id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable. Slot 1 Slot 1 Slot 2 Slot 2 Free Space . . . . . . Slot N Slot N Slot M . . . N 1 1 1 M 0 M ... 3 2 1 number of slots number of records PACKED UNPACKED, BITMAP Chen Qian @ University of Kentucky
Trade-off • Compared to the packed formats, the unpacked format has: • Pros: No need to move records into vacated slots after delete. No need to change the slot number. • Cons: To insert a record, one needs to find an empty slot, by scanning the bitmap. Chen Qian @ University of Kentucky
142 Bart 10 2.3 123 Milhouse 10 3.1 857 Lisa 8 4.3 456 Ralph 8 2.3 Variable-Length Records • Store records from the beginning of each page • Use a directory at the end of each page • To locate records and manage free space • Necessary for variable-length records Why store data and directoryat two different ends? Both can grow easily Chen Qian @ University of Kentucky
Options • Reorganize after every update/delete to avoid fragmentation (gaps between records) • Need to rewrite half of the block on average • What if records are fixed-length? • Reorganize after delete • Only need to move one record • Need a pointer to the beginning of free space • Do not reorganize after update • Need a bitmap indicating which slots are in use Chen Qian @ University of Kentucky
System Catalogs • For each relation: • name, file location, file structure (e.g., Heap file) • attribute name and type, for each attribute • index name, for each index • integrity constraints • For each index: • structure (e.g., B+ tree) and search key fields • For each view: • view name and definition • Plus statistics, authorization, buffer pool size, etc. Catalogs are themselves stored as relations! Chen Qian @ University of Kentucky
Indexes • A Heap file allows us to retrieve records: • by specifying the rid, or • by scanning all records sequentially • Sometimes, we want to retrieve records by specifying the values in one or more fields, e.g., • Find all students in the “CS” department • Find all students with a gpa > 3 • Indexes are file structures that enable us to answer such value-based queries efficiently. Chen Qian @ University of Kentucky
database indexing Search Basics • Given a value, locate the record(s) with this value SELECT * FROM R WHERE A = value; SELECT * FROM R, S WHERE R.A = S.B; • Other search criteria, e.g. • Range search SELECT * FROM R WHERE A > value; • Keyword search Chen Qian @ University of Kentucky
Dense and sparse indexes • Dense: one index entry for each search key value • Sparse: one index entry for each block • Records must be clustered according to the search key Chen Qian @ University of Kentucky
Dense versus sparse indexes • Index size • Sparse index is smaller • Requirement on records • Records must be clustered for sparse index • Lookup • Sparse index is smaller and may fit in memory • Dense index can directly tell if a record exists • Update • Easier for sparse index Chen Qian @ University of Kentucky
Primary and secondary indexes • Primary index • Created for the primary key of a table • Records are usually clustered according to the primary key • Can be sparse • Secondary index • Usually dense • SQL • PRIMARY KEY declaration automatically creates a primary index, UNIQUE key automatically creates a secondary index • Additional secondary index can be created on non-key attribute(s)CREATE INDEX StudentGPAIndex ON Student(GPA); Chen Qian @ University of Kentucky
Tree-Structured Indexes: Introduction • Tree-structured indexing techniques support both range selections and equality selections. • ISAM =IndexedSequentialAccessMethod • static structure; early index technology. • B+ tree: dynamic, adjusts gracefully under inserts and deletes. Chen Qian @ University of Kentucky
index entry P K P P K P K m 2 0 1 m 1 2 Index File kN k2 k1 Motivation for Index • ``Find all students with gpa > 3.0’’ • If data file is sorted, do binary search • Cost of binary search in a database can be quite high, Why? • Simple idea: Create an `index’ file. Data File Page N Page 3 Page 1 Page 2 Can do binary search on (smaller) index file! Chen Qian @ University of Kentucky
100, 200, …, 901 … Index blocks 100, 123, …, 192 200, … 901, …, 996 100, 108,119, 121 123, 129,… 192, 197,… 200, 202,… 901, 907,… 996, 997,… … … … Data blocks ISAM • What if an index is still too big? • Put a another (sparse) index on top of that! • ISAM (Index Sequential Access Method), more or less Example: look up 197 Chen Qian @ University of Kentucky
How many steps? • For a balanced tree of N pages, the num of search steps is O(logN) • Less than a constant times logN Chen Qian @ University of Kentucky
100, 200, …, 901 107 Overflow block … Index blocks 100, 123, …, 192 200, … 901, …, 996 100, 108,119, 121 123, 129,… 192, 197,… 200, 202,… 901, 907,… 996, 997,… … … … Data blocks Updates with ISAM • Overflow chains and empty data blocks degrade performance • Worst case: most records go into one long chain Example: insert 107 Example: delete 129 Chen Qian @ University of Kentucky
How many steps? • For a chain of N pages, the worst-case num of search steps is N • Much larger than O(logN)! Chen Qian @ University of Kentucky
A Note of Caution • ISAM is an old-fashioned idea • B+-trees are usually better, as we’ll see • But, ISAM is a good place to start to understand the idea of indexing • Upshot • Don’t brag about being an ISAM expert on your resume • Do understand how they work, and tradeoffs with B+-trees Chen Qian @ University of Kentucky
Summary (Contd.) • DBMS vs. OS File Support • DBMS needs features not found in many OSes, e.g., forcing a page to disk, controlling the order of page writes to disk, files spanning disks, ability to control pre-fetching and page replacement policy based on predictable access patterns, etc. • Variable length record format with field offset directory offers support for direct access to ith field and null values. Chen Qian @ University of Kentucky