1 / 57

Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications

Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications. C. Faloutsos Indexing and Hashing – part II. General Overview - rel. model. Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing.

zulema
Download Presentation

Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carnegie Mellon Univ.Dept. of Computer Science15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II

  2. General Overview - rel. model • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing 15-415 - C. Faloutsos

  3. Indexing- overview • ISAM and B-trees • hashing • Hashing vs B-trees • Indices in SQL • Advanced topics: • dynamic hashing • multi-attribute indexing 15-415 - C. Faloutsos

  4. (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium? 15-415 - C. Faloutsos

  5. Hashing A: Brilliant idea: key-to-address transformation: #0 page 123; Smith; Main str #123 page #999,999,999 15-415 - C. Faloutsos

  6. #0 page 123; Smith; Main str #123 page #999,999,999 Hashing Since space is NOT free: • use M, instead of 999,999,999 slots • hash function: h(key) = slot-id 15-415 - C. Faloutsos

  7. #0 page 123; Smith; Main str #h(123) M Hashing Typically: each hash bucket is a page, holding many records: 15-415 - C. Faloutsos

  8. #0 page 123; Smith; Main str. #h(123) M Hashing Notice: could have clustering, or non-clustering versions: 15-415 - C. Faloutsos

  9. EMP file ... #0 page ... 234; Johnson; Forbes ave 123 #h(123) 123; Smith; Main str. ... M 345; Tompson; Fifth ave ... Hashing Notice: could have clustering, or non-clustering versions: 15-415 - C. Faloutsos

  10. Indexing- overview • ISAM and B-trees • hashing • hashing functions • size of hash table • collision resolution • Hashing vs B-trees • Indices in SQL • Advanced topics: 15-415 - C. Faloutsos

  11. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method 15-415 - C. Faloutsos

  12. Design decisions - functions • Goal: uniform spread of keys over hash buckets • Popular choices: • Division hashing • Multiplication hashing 15-415 - C. Faloutsos

  13. Division hashing h(x) = (a*x+b) mod M • eg., h(ssn) = (ssn) mod 1,000 • gives the last three digits of ssn • M: size of hash table - choose a prime number, defensively (why?) 15-415 - C. Faloutsos

  14. eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing 15-415 - C. Faloutsos

  15. h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational Multiplication hashing 15-415 - C. Faloutsos

  16. quadratic hashing (bad) ... conclusion: use division hashing Other hashing functions 15-415 - C. Faloutsos

  17. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method 15-415 - C. Faloutsos

  18. eg., 50,000 employees, 10 employee-records / page Q: M=?? pages/buckets/slots Size of hash table 15-415 - C. Faloutsos

  19. eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and M: prime number Eg., in our case: M= closest prime to50,000/10 / 0.9 = 5,555 Size of hash table 15-415 - C. Faloutsos

  20. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method 15-415 - C. Faloutsos

  21. Collision resolution • Q: what is a ‘collision’? • A: ?? 15-415 - C. Faloutsos

  22. Collision resolution #0 page FULL #h(123) 123; Smith; Main str. M 15-415 - C. Faloutsos

  23. Collision resolution • Q: what is a ‘collision’? • A: ?? • Q: why worry about collisions/overflows? (recall that buckets are ~90% full) • A: ‘birthday paradox’ 15-415 - C. Faloutsos

  24. Collision resolution • open addressing • linear probing (ie., put to next slot/bucket) • re-hashing • separate chaining (ie., put links to overflow pages) 15-415 - C. Faloutsos

  25. Collision resolution linear probing: #0 page FULL #h(123) 123; Smith; Main str. M 15-415 - C. Faloutsos

  26. Collision resolution re-hashing #0 page h1() FULL #h(123) 123; Smith; Main str. h2() M 15-415 - C. Faloutsos

  27. Collision resolution separate chaining FULL 123; Smith; Main str. 15-415 - C. Faloutsos

  28. Design decisions - conclusions • function: division hashing • h(x) = ( a*x+b ) mod M • size M: ~90% util.; prime number. • collision resolution: separate chaining • easier to implement (deletions!); • no danger of becoming full 15-415 - C. Faloutsos

  29. Indexing- overview • ISAM and B-trees • hashing • Hashing vs B-trees • Indices in SQL • Advanced topics: • dynamic hashing • multi-attribute indexing 15-415 - C. Faloutsos

  30. Hashing vs B-trees: Hashing offers • speed ! ( O(1) avg. search time) ..but: 15-415 - C. Faloutsos

  31. Hashing vs B-trees: ..but B-trees give: • key ordering: • range queries • proximity queries • sequential scan • O(log(N)) guarantees for search, ins./del. • graceful growing/shrinking 15-415 - C. Faloutsos

  32. Hashing vs B-trees: thus: • B-trees are implemented in most systems footnotes: • hashing is not (why not?) • ‘dbm’ and ‘ndbm’ of UNIX: offer one or both 15-415 - C. Faloutsos

  33. Indexing- overview • ISAM and B-trees • hashing • Hashing vs B-trees • Indices in SQL • Advanced topics: • dynamic hashing • multi-attribute indexing 15-415 - C. Faloutsos

  34. Indexing in SQL • create index <index-name> on <relation-name> (<attribute-list>) • create unique index <index-name> on <relation-name> (<attribute-list>) • drop index <index-name> 15-415 - C. Faloutsos

  35. Indexing in SQL • eg., create index ssn-index on STUDENT (ssn) • or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id) 15-415 - C. Faloutsos

  36. Indexing- overview • ISAM and B-trees • hashing • Hashing vs B-trees • Indices in SQL • Advanced topics: (theoretical interest) • dynamic hashing • multi-attribute indexing 15-415 - C. Faloutsos

  37. Problem with static hashing • problem: overflow? • problem: underflow? (underutilization) 15-415 - C. Faloutsos

  38. Solution: Dynamic/extendible hashing • idea: shrink / expand hash table on demand.. • ..dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’ 15-415 - C. Faloutsos

  39. Extendible hashing #0 page FULL #h(123) 123; Smith; Main str. M 15-415 - C. Faloutsos

  40. Extendible hashing #0 page solution: split the bucket in two FULL #h(123) 123; Smith; Main str. M 15-415 - C. Faloutsos

  41. in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually: Extendible hashing 15-415 - C. Faloutsos

  42. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... 10011... 10110... 11... 1101... 101001... 15-415 - C. Faloutsos

  43. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... 10011... 10110... 11... 1101... 101001... 15-415 - C. Faloutsos

  44. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... split on 3-rd bit 10011... 10110... 11... 101001... 1101... 15-415 - C. Faloutsos

  45. Extendible hashing directory 0001... 0111... 00... 01... new page / bucket 10... 10011... 10101... 11... 101001... 10110... 1101... 15-415 - C. Faloutsos

  46. 0001... 0111... 000... 001... 010... 10011... 10101... 011... 101001... 10110... 100... 1101... 101... 110... 111... Extendible hashing directory (doubled) new page / bucket 15-415 - C. Faloutsos

  47. 0001... 0111... 000... 00... 001... 01... 10101... 010... 10... 10101... 10011... 10110... 101001... 11... 011... 100... 101001... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 0001... 0111... 10011... 1101... 15-415 - C. Faloutsos

  48. Extendible hashing • Summary: directory doubles on demand • or halves, on shrinking files • needs ‘local’ and ‘global’ depth (see book) • Mainly, of theoretical interest - same for • ‘linear hashing’ of Litwin • ‘order preserving’ • ‘perfect hashing’ (no collisions!) 15-415 - C. Faloutsos

  49. Indexing- overview • ISAM and B-trees • hashing • Hashing vs B-trees • Indices in SQL • Advanced topics: • dynamic hashing • multi-attribute indexing 15-415 - C. Faloutsos

  50. multiple-key access • how to support queries on multiple attributes, like • grade>=3 and course=‘415’ • major motivation: Geographic Information systems (GIS) 15-415 - C. Faloutsos

More Related