1 / 72

Principles of Database Management Systems

Principles of Database Management Systems. Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom). Topic for today. How to represent data on disk and in memory. "Executive Summary":.

ashtyn
Download Presentation

Principles of Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Database Management Systems Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) Notes 3: Representing Data

  2. Topic for today • How to represent data on disk and in memory "Executive Summary": • Principles are rather simple, but there are lots of variations in the details Notes 3: Representing Data

  3. Principles of Data Layout • Attributes of relational tuples (or objects) represented by sequences of bytes called fields • Fields grouped together into records • representation of tuples or objects • Records stored in blocks • File: collection of blocks that forms a relation (or the extent of an object class) Notes 3: Representing Data

  4. here Overview Data Items Records Blocks Files Memory Notes 3: Representing Data

  5. What we have available: Bytes 8 bits What are the data items we want to store? • a salary, a name • a date, a picture • … Notes 3: Representing Data

  6. To represent: • Integer (short): 2 bytes ( -32000…+32000) e.g., 35 is 00000000 00100011 • Integer (long): 4 bytes ( -2x109…+ 2x109) • Real, floating point (SQL FLOAT) 4 or 8 bytes • arithmetic interpretation by hardware Notes 3: Representing Data

  7. To represent: • Characters various coding schemes suggested, most popular is ASCII Example (8 bit ASCII): A: 01000001 a: 01100001 5: 00110101 LF: 00001010 Notes 3: Representing Data

  8. Can we use less than 1 byte/code? Yes, but only if desperate... To represent: 1111 1111 • Boolean e.g., TRUE FALSE 00000000 • Application specific • e.g., RED  1 GREEN  2 • BLUE  3 YELLOW  4 … Notes 3: Representing Data

  9. To represent: • Dates, e.g.: • Integer: # days since Jan 1, 1900 • 8 chars: YYYYMMDD • 7 chars: YYYYDDD • 10 chars: YYYY-MM-DD (SQL2) (not YYMMDD! Why?) • Time, e.g. • Integer: seconds since midnight • chars: HH:MM:SS[.FF…] (SQL2) Notes 3: Representing Data

  10. c c a a t t To represent: • String of characters • Null terminated e.g., • Length given e.g., - Fixed length 3 Notes 3: Representing Data

  11. To represent: • Bag of bits Length Bits Notes 3: Representing Data

  12. Key Point • Fixed length items • Variable length items • - usually length given at beginning Notes 3: Representing Data

  13. Also • Type of an item: Tells us how to interpret (plus size if fixed) • normally indicated in the record schema (see in a moment) Notes 3: Representing Data

  14. Record - Collection of related fields E.g.: Employee record: name field, salary field, date-of-hire field, ... Notes 3: Representing Data

  15. Types of records: • Main choices: • FIXED vs VARIABLE FORMAT • FIXED vs VARIABLE LENGTH Notes 3: Representing Data

  16. Fixed format A SCHEMA (not record) contains following information - number of fields - type of each field - order in record - meaning of each field Notes 3: Representing Data

  17. Example: fixed format and length Employee record (1) E#, 2 byte integer (2) E.name, 10 char. Schema (3) Dept, 2 byte code 46 F o r d 02 Records 83 J o n e s 01 Notes 3: Representing Data

  18. Variable format • Record itself contains format; “Self Describing” Notes 3: Representing Data

  19. 2 5 I 46 4 S 4 F o r d Example: variable format and length # Fields Code identifying field as E# Integer type Code for Ename String type Length of str. Field name codes could also be strings, i.e. tags ( XML as a data interchange format) Notes 3: Representing Data

  20. Variable format useful for: • “sparse” records • e.g, patient records with thousands of possible tests • repeating fields • information integration from heterogeneous sources Notes 3: Representing Data

  21. EXAMPLE: var format record with repeating fields Employee  one or more children 3 E_name: Fred Child: Sally Child: Tom Notes 3: Representing Data

  22. Note: Repeating fields does not imply - variable format, nor - variable size Fred Sally Tom -- • Key is to allocate maximum number of repeating fields (if not used  NULL) Notes 3: Representing Data

  23. Many variants between fixed and variable format: Ex. #1: Include record type in record record type record length tells me what to expect (i.e. points to schema) 5 27 . . . . Notes 3: Representing Data

  24. Record header - data at beginning that describes record May contain: - indication of record type - record length - time stamp - other stuff ... Notes 3: Representing Data

  25. E.g.: All employees have E#, name, dept; Other fields vary. 25 Smith Toy 2 Hobby:chess retired # of var fields Ex #2of variant between FIXED/VAR format • Hybrid format • one part is fixed, other variable Notes 3: Representing Data

  26. Also, many variations in internal organization of record Just to show one: length of field * * * 3 10 F1 5 F2 12 F3 total size 3 32 5 15 20 F1 F2 F3 0 1 2 3 4 5 15 20 32 offsets Notes 3: Representing Data

  27. assume fixed length blocks assume a single relation (for now) Next: placing records into blocks blocks ... a file Notes 3: Representing Data

  28. Issues in storing records in blocks: (1) separating records (2) spanned vs. unspanned (3) mixed record types – clustering (4) split records (5) sequencing (6) addressing records Notes 3: Representing Data

  29. (1) Separating records R1 R2 R3 Block (a) fixed size recs. -> no need to separate (b) special marker (c) give record lengths (or offsets) - within each record - in block header (see later) Notes 3: Representing Data

  30. R1 R2 R3 R4 R5 R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) (2) Spanned vs. Unspanned • Unspanned: records are within one block block 1 block 2 ... • Spanned: records span block boundaries block 1 block 2 ... Notes 3: Representing Data

  31. R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) With spanned records: need indication need indication of partial record of continuation “pointer” to rest (+ from where?) Notes 3: Representing Data

  32. Spanned vs. unspanned: • Unspanned is much simpler, but may waste space… • Spanned necessary if record size > block size (e.g., fields containing large "BLOB"s for, say, MPEG video clips) Notes 3: Representing Data

  33. block 1 block 2 2050 bytes wasted 2046 2050 bytes wasted 2046 R1 R2 Example (of unspanned records) 106 records each of size 2,050 bytes (fixed) block size = 4096 bytes • Space used about 4 x 109 B, about half wasted Notes 3: Representing Data

  34. (3) Mixed record types • Mixed - records of different types (e.g. DEPT, EMPLOYEE) allowed in same block e.g., a block: Dep d1 Emp e1 Emp e2 Notes 3: Representing Data

  35. Why do we want to mix? Answer: CLUSTERING Records that are frequently accessed together should be in the same block Notes 3: Representing Data

  36. Example Q1: select DEPT.Name, EMP.Name, … from DEPT, EMP where DEPT. Name = EMP.DeptName a block DEPT,Name=Toy, ... EMP,DeptName=Toy, ... EMP,DeptName=Toy, ... … Notes 3: Representing Data

  37. If Q1 frequent, clustering is good • But consider Q2: SELECT * FROM DEPT If Q2 is frequent, clustering is counter- productive Notes 3: Representing Data

  38. (4) Split records Fixed part in one block Typically for hybrid format Variable part in another block Notes 3: Representing Data

  39. R2 (a) R2 (b) R2 (c) Blocks with variable recs. Block with fixed recs. R1 (a) R1 (b) Notes 3: Representing Data

  40. (5) Sequencing • Ordering records in file (and block) by some key value Sequential file (  sequenced) Notes 3: Representing Data

  41. Why sequencing? Typically to make it possible to efficiently read records in order (e.g., to do a merge-join — discussed later) Notes 3: Representing Data

  42. Sequencing Options (a) Next record physically contiguous ... (b) Linked R1 Next (R1) R1 Next (R1) Notes 3: Representing Data

  43. header R2.1 R1.3 R4.7 Sequencing Options (c) Overflow area Records in sequence R1 R2 R3 R4 R5 Notes 3: Representing Data

  44. Many options: Physical Indirect (6) Addressing records • How does one refer to records? Rx Notes 3: Representing Data

  45. Purely Physical Device ID E.g., Record Cylinder # Address = Track # or ID Block # Offset in block Block ID Notes 3: Representing Data

  46. Fully Indirect E.g., Record ID is arbitrary byte string (say, an object ID) map table rec ID r address a Rec ID Physical addr. Notes 3: Representing Data

  47. Tradeoff Flexibility Cost to move records of indirection (for deletions, insertions) Notes 3: Representing Data

  48. Physical Indirect Many options in between … Notes 3: Representing Data

  49. Ex #1 Indirection in block Header A block: Free space R4 R3 R2 R1 Notes 3: Representing Data

  50. Block header - data at beginning that describes block May contain: - File ID (or RELATION or DB ID); the ID of this block - Record directory; Pointer to free space - Type of block (e.g. contains records of type 4; is overflow, …) - Pointer to other blocks “like it” (say, if part of an index structure) - Timestamp ... Notes 3: Representing Data

More Related