CS 222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations

CS222/CS122C:Principles of Data ManagementLecture #4Catalogs, File Organizations Instructor: Chen Li

Buffer management: DBMS vs. OS File System OS does disk space & buffer management – so why not let the OS manage these tasks…? • Differences in OS support: portability issues • Some limitations, e.g., files can’t span disks. • Buffer management in DBMS requires ability to: • pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), and • adjust replacement policy, and prefetch pages based on access patterns in typical DB operations.

Topic: System Catalogs • For each relation: • name, file name, file structure (e.g., Heap) • name, type, and length (if fixed) for each attribute • index name, target, and kind for each index • also integrity constraints, defaults, nullability, etc. • For each index: • structure (e.g., B+ tree) and search key fields • For each view: • view name and definition (including query) • Plus statistics, authorization, buffer pool size, etc. * Catalogs themselves stored as record-based files too!

Attr_Cat(attr_name, rel_name, type, position)

Files of Records: Basic Summary • Disks provide cheap, non-volatile storage. • Random access, but cost depends on location of page on disk; important to arrange data sequentially to minimize seek and rotation delays. • Buffer manager brings pages into RAM. • Page stays in RAM (at least!) until unpinned by last among concurrent requestors. • Written to disk when frame chosen for replacement (some time after dirtying requestor unpins the page). • Choice of frame to replace based on replacement policy. • Could be worth prefetching several pages at a time.

Summary (Contd.) • DBMS vs. OS File Support • DBMS needs features not found in many OS’s, such as forcing a page to disk, controlling the order of the page writes to disk, letting files span disks, controlling prefetching and page replacement policies based on (predictable) DB access patterns, etc. • Variable length record format with field offset directory offers support for direct access to the i'th field and also supports null values. • Slotted page format supports variable length records and allows records to move in a page.

Summary (Contd.) • File layer keeps track of pages in a file and supports abstraction of a collection of records. • Pages with free space identified using linked list or directory structure (similar to how pages in file itself are tracked; may be integrated with that). • Indexes support efficient retrieval of records by mapping from values in fields to rids. • Catalog relations store information about relations, indexes and views. (Information common to all records in a given collection.)

Next topic: File Organizations Many alternatives exist. Each one is ideal for some situations, but not so good in others: • Heap (random ordered) files:Suitable when typical access is a file scan retrieving all record or access comes through a variety of secondary indexes. • Sorted Files:Best if records must be retrieved in some order, or only a `range’ of records is needed. • Indexes: Data structures to organize records via trees or hashing. • Like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields. • Updates are much faster than in sorted files.

Cost Model We will ignore CPU costs, for simplicity, so: • B: The number of data pages • R: Number of records per page • D: (Average) time to read or write disk page • Counting the number of page I/Os ignores gains of prefetching a sequence of pages; thus, even the real I/O cost is only roughly approximated for now. • Average-case analysis; based on several simplistic assumptions. • * Good enough to convey the overall trends!

Comparison of File Organizations • Heap files (random order; insert at eof) • Sorted files, sorted on <age, sal>

Operations to Compare • Scan: Fetch all records from disk • Equality search • Range selection • Insert a record • Delete a record

Assumptions for Our Analysis • Heap Files: • Equality selection on key; exactly one match. • Sorted Files: • File compacted after a deletion (vs. a deleted bit).

Cost of Operations * Several assumptions underlie these (rough) estimates!

CS 222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations