1 / 91

CS 245: Database System Principles Notes 03: Disk Organization

CS 245: Database System Principles Notes 03: Disk Organization. Hector Garcia-Molina. Topics for today. How to lay out data on disk How to move it to memory. What we have available: Bytes. 8 bits. What are the data items we want to store?. a salary a name a date a picture.

Download Presentation

CS 245: Database System Principles Notes 03: Disk Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 245: Database System PrinciplesNotes 03: Disk Organization Hector Garcia-Molina Notes 3

  2. Topics for today • How to lay out data on disk • How to move it to memory Notes 3

  3. What we have available: Bytes 8 bits What are the data items we want to store? • a salary • a name • a date • a picture Notes 3

  4. To represent: • Integer (short): 2 bytes e.g., 35 is 00000000 00100011 • Real, floating point • n bits for mantissa, m for exponent…. Notes 3

  5. To represent: • Characters  various coding schemes suggested, most popular is ascii Example: A: 1000001 a: 1100001 5: 0110101 LF: 0001010 Notes 3

  6. Can we use less than 1 byte/code? Yes, but only if desperate... To represent: 1111 1111 • Boolean e.g., TRUE FALSE 00000000 • Application specific • e.g., RED  1 GREEN  3 • BLUE  2 YELLOW  4 … Notes 3

  7. To represent: • Dates e.g.: - Integer, # days since Jan 1, 1900 - 8 characters, YYYYMMDD - 7 characters, YYYYDDD (not YYMMDD! Why?) • Time e.g. - Integer, seconds since midnight - characters, HHMMSSFF Notes 3

  8. c c a a t t To represent: • String of characters • Null terminated e.g., • Length given e.g., - Fixed length 3 Notes 3

  9. To represent: • Bag of bits Length Bits Notes 3

  10. Key Point • Fixed length items • Variable length items • - usually length given at beginning Notes 3

  11. Also • Type of an item: Tells us how to interpret (plus size if fixed) Notes 3

  12. Overview Data Items Records Blocks Files Memory Notes 3

  13. Record - Collection of related data items (called FIELDS) E.g.: Employee record: name field, salary field, date-of-hire field, ... Notes 3

  14. Types of records: • Main choices: • FIXED vs VARIABLE FORMAT • FIXED vs VARIABLE LENGTH Notes 3

  15. Fixed format A SCHEMA (not record) contains following information - # fields - type of each field - order in record - meaning of each field Notes 3

  16. Example: fixed format and length Employee record (1) E#, 2 byte integer (2) E.name, 10 char. Schema (3) Dept, 2 byte code 55 s m i t h 02 Records 83 j o n e s 01 Notes 3

  17. Variable format • Record itself contains format “Self Describing” Notes 3

  18. 2 5 I 46 4 S 4 F O R D Example: variable format and length # Fields Code identifying field as E# Integer type Code for Ename String type Length of str. Field name codes could also be strings, i.e. TAGS Notes 3

  19. Variable format useful for: • “sparse” records • repeating fields • evolving formats But may waste space... Notes 3

  20. EXAMPLE: var format record with repeating fields Employee  one or more  children 3 E_name: Fred Child: Sally Child: Tom Notes 3

  21. Note: Repeating fields does not imply - variable format, nor - variable size John Sailing Chess -- • Key is to allocate maximum number of repeating fields (if not used  null) Notes 3

  22. Many variants between fixed - variable format: Ex. #1: Include record type in record record type record length tells me what to expect (i.e. points to schema) 5 27 . . . . Notes 3

  23. Record header - data at beginning that describes record May contain: - record type - record length - time stamp - other stuff ... Notes 3

  24. E.g.: All employees have E#, name, dept other fields vary. 25 Smith Toy 2 Hobby:chess retired # of var fields Ex #2of variant between FIXED/VAR format • Hybrid format • one part is fixed, other variable Notes 3

  25. Also, many variations in internal organization of record Just to show one: length of field * * * 3 10 F1 5 F2 12 F3 total size 3 32 5 15 20 F1 F2 F3 0 1 2 3 4 5 15 20 offsets Notes 3

  26. Question: We have seen examples for * Fixed format and length records * Variable format and length records (a) Does fixed format and variable length make sense? (b) Does variable format and fixed length make sense? Notes 3

  27. Other interesting issues: • Compression • within record - e.g. code selection • collection of records - e.g. find common patterns • Encryption Notes 3

  28. Encrypting Records new record r E(r) trusted processor dbms E(r1) E(r2) E(r3) E(r4) ... Notes 3

  29. Encrypting Records search F(r)=x ?? trusted processor dbms E(r1) E(r2) E(r3) E(r4) ... Notes 3

  30. Search Key in the Clear search k=2 Q: k=2 A: [2, E(b2)] trusted processor dbms [1, E(b1)] [2, E(b2)] [3, E(b3)] [4, E(b4)] ... • each record is [k,b] • store [k, E(b)] • can search for records with k=x Notes 3

  31. Encrypt Key search k=2 Q: k’=E(2) A: [E(2), E(b2)] trusted processor dbms [E(1), E(b1)] [E(2), E(b2)] [E(3), E(b3)] [E(4), E(b4)] ... • each record is [k,b] • store [E(k), E(b)] • can search for records with k=E(x) Notes 3

  32. Issues • Hard to do range queries • Encryption not good • Better to use encryption that does not always generate same cyphertext k k E(k, random) E D simplification Notes 3

  33. How Do We Search Now? ??? search k=2 A: [E(2,dhe), E(b2)] [E(2, lkz), E(b4)] Q: k’=E(2) trusted processor dbms [E(1, abc), E(b1)] [E(2, dhe), E(b2)] [E(3, nft), E(b3)] [E(2, lkz), E(b4)] ... • each record is [k,b] • store [E(k, rand), E(b)] • can search for records with k=E(x,???)? Notes 3

  34. Solution? • Develop new decryption function: D(f(k1), E(k2, rand)) is true if k1=k2 Notes 3

  35. Solution? • Develop new decryption function: D(f(k1), E(k2, rand)) is true if k1=k2 Q: check if D(f(2),*) true search k=2 A: [E(2,dhe), E(b2)] [E(2, lkz), E(b4)] trusted processor dbms [E(1, abc), E(b1)] [E(2, dhe), E(b2)] [E(3, nft), E(b3)] [E(2, lkz), E(b4)] ... Notes 3

  36. Issues? • Cannot do non-equality predicates • Hard to build indexes Notes 3

  37. assume fixed length blocks assume a single file (for now) Next: placing records into blocks blocks ... a file Notes 3

  38. Options for storing records in blocks: (1) separating records (2) spanned vs. unspanned (3) mixed record types – clustering (4) split records (5) sequencing (6) indirection Notes 3

  39. (1) Separating records R1 R2 R3 Block (a) no need to separate - fixed size recs. (b) special marker (c) give record lengths (or offsets) - within each record - in block header Notes 3

  40. (2) Spanned vs. Unspanned • Unspanned: records must be within one block block 1 block 2 ... • Spanned block 1 block 2 ... R1 R2 R3 R4 R5 R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) Notes 3

  41. R1 R2 R3 (a) R3 (b) R4 R5 R6 R7 (a) With spanned records: need indication need indication of partial record of continuation “pointer” to rest (+ from where?) Notes 3

  42. Spanned vs. unspanned: • Unspanned is much simpler, but may waste space… • Spanned essential if record size > block size Notes 3

  43. block 1 block 2 2050 bytes wasted 2046 2050 bytes wasted 2046 R1 R2 Example 106 records each of size 2,050 bytes (fixed) block size = 4096 bytes • Total wasted = 2 x 109 Utiliz = 50% • Total space = 4 x 109 Notes 3

  44. (3) Mixed record types • Mixed - records of different types (e.g. EMPLOYEE, DEPT) allowed in same block e.g., a block EMP e1 DEPT d1 DEPT d2 Notes 3

  45. Why do we want to mix? Answer: CLUSTERING Records that are frequently accessed together should be in the same block Notes 3

  46. Compromise: No mixing, but keep related records in same cylinder ... Notes 3

  47. Example Q1: select A#, C_NAME, C_CITY, … from DEPOSIT, CUSTOMER where DEPOSIT.C_NAME = CUSTOMER.C.NAME a block CUSTOMER,NAME=SMITH DEPOSIT,NAME=SMITH DEPOSIT,NAME=SMITH Notes 3

  48. If Q1 frequent, clustering good • But if Q2 frequent Q2: SELECT * FROM CUSTOMER CLUSTERING IS COUNTER PRODUCTIVE Notes 3

  49. (4) Split records Fixed part in one block Typically for hybrid format Variable part in another block Notes 3

  50. R2 (a) This block also has fixed recs. R2 (b) R2 (c) Block with variable recs. Block with fixed recs. R1 (a) R1 (b) Notes 3

More Related