750 likes | 1.09k Views
Principles of Database Management Systems. Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom). Topic for today. How to represent data on disk and in memory. "Executive Summary":.
E N D
Principles of Database Management Systems Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) Notes 3: Representing Data
Topic for today • How to represent data on disk and in memory "Executive Summary": • Principles are rather simple, but there are lots of variations in the details Notes 3: Representing Data
Principles of Data Layout • Attributes of relational tuples (or objects) represented by sequences of bytes called fields • Fields grouped together into records • representation of tuples or objects • Records stored in blocks • File: collection of blocks that forms a relation (or the extent of an object class) Notes 3: Representing Data
here Overview Data Items Records Blocks Files Memory Notes 3: Representing Data
What we have available: Bytes 8 bits What are the data items we want to store? • a salary, a name • a date, a picture • … Notes 3: Representing Data
To represent: • Integer (short): 2 bytes ( -32000…+32000) e.g., 35 is 00000000 00100011 • Integer (long): 4 bytes ( -2x109…+ 2x109) • Real, floating point (SQL FLOAT) 4 or 8 bytes • arithmetic interpretation by hardware Notes 3: Representing Data
To represent: • Characters various coding schemes suggested, most popular is ASCII Example (8 bit ASCII): A: 01000001 a: 01100001 5: 00110101 LF: 00001010 Notes 3: Representing Data
Can we use less than 1 byte/code? Yes, but only if desperate... To represent: 1111 1111 • Boolean e.g., TRUE FALSE 00000000 • Application specific • e.g., RED 1 GREEN 2 • BLUE 3 YELLOW 4 … Notes 3: Representing Data
To represent: • Dates, e.g.: • Integer: # days since Jan 1, 1900 • 8 chars: YYYYMMDD • 7 chars: YYYYDDD • 10 chars: YYYY-MM-DD (SQL2) (not YYMMDD! Why?) • Time, e.g. • Integer: seconds since midnight • chars: HH:MM:SS[.FF…] (SQL2) Notes 3: Representing Data
c c a a t t To represent: • String of characters • Null terminated e.g., • Length given e.g., - Fixed length 3 Notes 3: Representing Data
To represent: • Bag of bits Length Bits Notes 3: Representing Data
Key Point • Fixed length items • Variable length items • - usually length given at beginning Notes 3: Representing Data
Also • Type of an item: Tells us how to interpret (plus size if fixed) • normally indicated in the record schema (see in a moment) Notes 3: Representing Data
Record - Collection of related fields E.g.: Employee record: name field, salary field, date-of-hire field, ... Notes 3: Representing Data
Types of records: • Main choices: • FIXED vs VARIABLE FORMAT • FIXED vs VARIABLE LENGTH Notes 3: Representing Data
Fixed format A SCHEMA (not record) contains following information - number of fields - type of each field - order in record - meaning of each field Notes 3: Representing Data
Example: fixed format and length Employee record (1) E#, 2 byte integer (2) E.name, 10 char. Schema (3) Dept, 2 byte code 46 F o r d 02 Records 83 J o n e s 01 Notes 3: Representing Data
Variable format • Record itself contains format; “Self Describing” Notes 3: Representing Data
2 5 I 46 4 S 4 F o r d Example: variable format and length # Fields Code identifying field as E# Integer type Code for Ename String type Length of str. Field name codes could also be strings, i.e. tags ( XML as a data interchange format) Notes 3: Representing Data
Variable format useful for: • “sparse” records • e.g, patient records with thousands of possible tests • repeating fields • information integration from heterogeneous sources Notes 3: Representing Data
EXAMPLE: var format record with repeating fields Employee one or more children 3 E_name: Fred Child: Sally Child: Tom Notes 3: Representing Data
Note: Repeating fields does not imply - variable format, nor - variable size Fred Sally Tom -- • Key is to allocate maximum number of repeating fields (if not used NULL) Notes 3: Representing Data
Many variants between fixed and variable format: Ex. #1: Include record type in record record type record length tells me what to expect (i.e. points to schema) 5 27 . . . . Notes 3: Representing Data
Record header - data at beginning that describes record May contain: - indication of record type - record length - time stamp - other stuff ... Notes 3: Representing Data
E.g.: All employees have E#, name, dept; Other fields vary. 25 Smith Toy 2 Hobby:chess retired # of var fields Ex #2of variant between FIXED/VAR format • Hybrid format • one part is fixed, other variable Notes 3: Representing Data
Also, many variations in internal organization of record Just to show one: length of field * * * 3 10 F1 5 F2 12 F3 total size 3 32 5 15 20 F1 F2 F3 0 1 2 3 4 5 15 20 32 offsets Notes 3: Representing Data
assume fixed length blocks assume a single relation (for now) Next: placing records into blocks blocks ... a file Notes 3: Representing Data
Issues in storing records in blocks: (1) separating records (2) spanned vs. unspanned (3) mixed record types – clustering (4) split records (5) sequencing (6) addressing records Notes 3: Representing Data
(1) Separating records R1 R2 R3 Block (a) fixed size recs. -> no need to separate (b) special marker (c) give record lengths (or offsets) - within each record - in block header (see later) Notes 3: Representing Data
R1 R2 R3 R4 R5 R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) (2) Spanned vs. Unspanned • Unspanned: records are within one block block 1 block 2 ... • Spanned: records span block boundaries block 1 block 2 ... Notes 3: Representing Data
R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) With spanned records: need indication need indication of partial record of continuation “pointer” to rest (+ from where?) Notes 3: Representing Data
Spanned vs. unspanned: • Unspanned is much simpler, but may waste space… • Spanned necessary if record size > block size (e.g., fields containing large "BLOB"s for, say, MPEG video clips) Notes 3: Representing Data
block 1 block 2 2050 bytes wasted 2046 2050 bytes wasted 2046 R1 R2 Example (of unspanned records) 106 records each of size 2,050 bytes (fixed) block size = 4096 bytes • Space used about 4 x 109 B, about half wasted Notes 3: Representing Data
(3) Mixed record types • Mixed - records of different types (e.g. DEPT, EMPLOYEE) allowed in same block e.g., a block: Dep d1 Emp e1 Emp e2 Notes 3: Representing Data
Why do we want to mix? Answer: CLUSTERING Records that are frequently accessed together should be in the same block Notes 3: Representing Data
Example Q1: select DEPT.Name, EMP.Name, … from DEPT, EMP where DEPT. Name = EMP.DeptName a block DEPT,Name=Toy, ... EMP,DeptName=Toy, ... EMP,DeptName=Toy, ... … Notes 3: Representing Data
If Q1 frequent, clustering is good • But consider Q2: SELECT * FROM DEPT If Q2 is frequent, clustering is counter- productive Notes 3: Representing Data
(4) Split records Fixed part in one block Typically for hybrid format Variable part in another block Notes 3: Representing Data
R2 (a) R2 (b) R2 (c) Blocks with variable recs. Block with fixed recs. R1 (a) R1 (b) Notes 3: Representing Data
(5) Sequencing • Ordering records in file (and block) by some key value Sequential file ( sequenced) Notes 3: Representing Data
Why sequencing? Typically to make it possible to efficiently read records in order (e.g., to do a merge-join — discussed later) Notes 3: Representing Data
Sequencing Options (a) Next record physically contiguous ... (b) Linked R1 Next (R1) R1 Next (R1) Notes 3: Representing Data
header R2.1 R1.3 R4.7 Sequencing Options (c) Overflow area Records in sequence R1 R2 R3 R4 R5 Notes 3: Representing Data
Many options: Physical Indirect (6) Addressing records • How does one refer to records? Rx Notes 3: Representing Data
Purely Physical Device ID E.g., Record Cylinder # Address = Track # or ID Block # Offset in block Block ID Notes 3: Representing Data
Fully Indirect E.g., Record ID is arbitrary byte string (say, an object ID) map table rec ID r address a Rec ID Physical addr. Notes 3: Representing Data
Tradeoff Flexibility Cost to move records of indirection (for deletions, insertions) Notes 3: Representing Data
Physical Indirect Many options in between … Notes 3: Representing Data
Ex #1 Indirection in block Header A block: Free space R4 R3 R2 R1 Notes 3: Representing Data
Block header - data at beginning that describes block May contain: - File ID (or RELATION or DB ID); the ID of this block - Record directory; Pointer to free space - Type of block (e.g. contains records of type 4; is overflow, …) - Pointer to other blocks “like it” (say, if part of an index structure) - Timestamp ... Notes 3: Representing Data