1 / 46

Hashing on the Disk

Hashing on the Disk. Keys are stored in “ disk pages ” (“ buckets ”) several records fit within one page Retrieval: find address of page bring page into main memory searching within the page comes for free. data pages. Σ key space. 0. 1. 2. hash function. m-1. b.

daltonk
Download Presentation

Hashing on the Disk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing on the Disk • Keys are stored in “disk pages” (“buckets”) • several records fit within one page • Retrieval: • find address of page • bring page into main memory • searching within the page comes for free Hashing

  2. data pages Σ key space 0 1 2 hash function . . . . m-1 b • page size b: maximum number of records in page • space utilization u: measure of the use of space Hashing

  3. overflow x x x x x x Collisions • Keys that hash to the same address are stored within the same page • If the page is full: • page splits: allocate a new page and split page content between the old and the new page or • overflows: list of overflow pages Hashing

  4. Access Time • Goal: find key in one disk access • Access time ~number of accesses • Large u: good space utilization but many overflows or splits => more disk accesses • Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses Hashing

  5. Categories of Methods • Static: require file reorganization • open addressing, separate chaining • Dynamic: dynamic file growth, adapt to file size • dynamic hashing, • extendible hashing, • linear hashing, • spiral storage… Hashing

  6. Static Hashing Schemes • Open Addressing for disk • One file: array of pointers to list of disk pages • Data files: array of data pages with keys mapped to the same address • Open Addressing: One file with pages holding keys mapped to the same address • Performance drops as u increases Hashing

  7. Dynamic Hashing Schemes • File size adapts to data size without total reorganization • Typically 1-3 disk accesses to access a key • Access time and u are a typical trade-off • u between 50-100% (typically 69%) • Complicated implementation Hashing

  8. index data pages Schemes With Index • Two disk accesses: • one to access the index, one to access the data • with index in main memory => one disk access • Problem: the index may become too large • Dynamic hashing (Larson 1978) • Extendible hashing (Fagin et.al. 1979) Hashing

  9. address space data space Schemes Without Index • Ideally, less space and less disk accesses (at least one) • Linear Hashing (Litwin 1980) • Linear Hashing with Partial Expansions (Larson 1980) • Spiral Storage (Martin 1979) Hashing

  10. Hash Functions • Support for shrinking or growing file • shrinking or growing address space, the hash function adapts to these changes • hash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0 • hi(key)=bi-1…b2b1b0supports 2i addresses • hi: one more bit thanhi-1to address larger files Hashing

  11. Index: binary tree 1 2 3 4 b data pages h2(k) 2nd level h1(k) 1stlevel Dynamic Hashing (Larson1978) • Two level index • primary h1(key): accesses a hash table • secondary h2(key): accesses a binary tree Hashing

  12. Index • Fixed (static): h1(key)=key mod m • Dynamic behavior on secondary index • h2(key) uses i bits of key • the bit sequence of h2=bi-1…b2b1b0denotes which path on the binary tree index to follow in order to access the data page • scan h2 from right to left (bit 1: follow right path, bit 0: follow left path) Hashing

  13. index 0 1 2 3 4 5 0 h1=1, h2=“0” 1 0 h1=1, h2=“01” 2 1 h1=1, h2=“11” 3 1 0 h1=5, h2= any 4 b data pages h2(k) 2nd level h1(k) 1stlevel h1(key) = key mod 6 h2(key) = “01”<= depth of binary tree = 2 Hashing

  14. b 0 1 2 3 0 1 2 3 h1=1,h2=any 0 0 1 2 3 1 Insertions • Initially fixed size primary index and no data • insert record in new page under h1address • if page is full, allocate one extra page • split keys between old and new page • use one extra bit in h2 for addressing h1=1, h2=0 h1=1, h2=1 Hashing

  15. 0 1 h1=0,h2=any 0 1 1 b 2 2 h1=3,h2=any 2 3 3 storage index 0 0 1 h1=0,h2=0 1 1 3 h1=0,h2=1 2 3 2 h1=3,h2=any 0 h1=0,h2=0 1 0 0 3 1 h1=0,h2=01 1 1 h1=0,h2=11 4 2 0 h1=3,h2=0 2 3 1 h1=3,h2=1 5 Hashing

  16. Deletions • Find record to be deleted using h1, h2 • Delete record • Check “sibling” page: • less than b records in both pages ? • if yes merge the two pages • delete one empty page • shrink binary tree index by one level and reduce h2 by one bit Hashing

  17. 1 0 merging 1 2 2 3 3 4 1 0 delete 1 2 2 3 3 4 Hashing

  18. Extendible Hashing (Fagin et.al. 1979) • Dynamic hashing without index • Primary hashing is omitted • Only secondary hashing with all binary trees at the same level • The index shrinks and grows according to file size • Data pages attached to the index Hashing

  19. 0 0 1 2 3 4 dynamic hashing 1 0 0 1 2 3 4 1 0 1 number of address bits 00 01 10 11 2 2 1 dynamic hashing with all binary trees at same level Hashing

  20. storage index 0 0 b Insertions • Initially 1 index and 1 data page • 0 address bits • insert records in data page local depth l : Number of address bits global depth d: size of index 2d Hashing

  21. storage index d l 0 0 d: global depth = 1 l : local depth = 1 b l d 1 1 0 1 1 Page “0” Overflows Hashing

  22. Page “0” Overflows (cont.) • 1 more key bit for addressing and 1 extra page => index doubles !! • Split contents of previous page between 2 pages according to next bit of key • Global depthd: number of index bits =>2d index size • Local depthl : number of bits for record addressing Hashing

  23. contain records with same 2 bits of key d 2 2 00 01 10 11 2 1 contains records with same 1st bit of key Page “0” Overflows (again) Hashing

  24. 3 2 000 001 010 011 100 101 110 111 d 3 1 more key bit for addressing 3 2d-l: number of pointers to page 1 Page “01” Overflows Hashing

  25. 2 3 000 001 010 011 100 101 110 111 3 3 2 +1 2 Page “100” Overflows • no need to double index • page 100 splits into two (1 new page) • local depth l is increased by 1 Hashing

  26. Insertion Algorithm • If l < d, split overflowed page (1 extra page) • If l = d => index is doubled, page is split • d is increased by 1=>1 more bit for addressing • update pointers (either way): • if d prefix bits are used for addressing d=d+1; for (i=2d-1, i>=0,i--) index[i]=index[i/2]; • if d suffix bitsare used for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1; d=d+1 Hashing

  27. Deletion Algorithm • Find and delete record • Check sibling page • If less than b records in both pages • merge pages and free empty page • decrease local depth l by 1 (records in merged page have 1 less common bit) • if l < d everywhere => reduce index (half size) • update pointers Hashing

  28. 2 3 000 001 010 011 100 101 110 111 3 delete with merging 3 2 2 3 2 000 001 010 011 100 101 110 111 2 2 2 00 01 10 11 l < d 2 2 2 2 2 Hashing

  29. Observations • A page splits and there are more than b keys with same next bit • take one more bit for addressing (increase l) • if d=l the index doubles again !! • Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) • if distribution is known, transform it to uniform • Dynamic hashing performs better for non-uniform distributions (affected locally) Hashing

  30. Performance • For n: records and page size b • expected size of index (Flajolet) • 1 disk access/retrieval when index in main memory • 2 disk accesses when index is on disk • overflows increase number of disk accesses Hashing

  31. b b before splitting after splitting After splitting Storage Utilization with Page Splitting • In general 50% < u < 100% • On the average u ~ ln2 ~ 69% (no overflows) Hashing

  32. b b Storage Utilization with Overflows • Achieves higher u and avoids page doubling (d=l) • higher u is achieved for small overflow pages • u=2b/3b~66% after splitting • small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75% • double index only if the overflow overflows!! Hashing

  33. Linear Hashing (Litwin 1980) • Dynamic scheme without index • Indices refer to page addresses • Overflows are allowed • The file grows one page at a time • The page which splits is not always the one which overflowed • The pages split in a predetermined order Hashing

  34. p b p b Linear Hashing (cont.) • Initially n empty pages • p points to the page that splits • Overflows are allowed Hashing

  35. p File Growing • A page splits whenever the “splitting criterion” is satisfied • a page is added at the end of the file • pointer p points to the next page • split contents of old page between old and new page based on key values Hashing

  36. p 0 1 2 3 4 125 320 90 435 16 711 402 27 737 712 613 303 4 319 • b=bpage=4, boverflow=1 • initially n=5 pages • hash function h0=k mod 5 • splitting criterion u > A% • alternatively split when overflow overflows, etc. new element 438 215 522 Hashing

  37. p 0 1 2 3 4 5 320 90 16 711 402 27 737 712 613 303 438 4 319 125 435 215 522 • Page 5 is added at end of file • The contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10 • p points to the next page Hashing

  38. Hash Functions • Initially h0=key mod n • As new pages are added at end of file, h0 alone becomes insufficient • The file will eventually double its size • In that case use h1=key mod 2n • In the meantime • use h0 for pages not yet split • use h1 for pages that have already split • Split contents of page pointed to by p based on h1 Hashing

  39. Hash Functions (cont.) • When the file has doubled its size, h0 is no longer needed • set h0=h1 and continue (e.g., h0=k mod 10) • The file will eventually double its size again • Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) Hashing

  40. Hash Functions • Initially n pages and 0 <= h0(k) <= n • Series of hash functions • Selection of hash function: if hi(k) >= p then use hi(k) else use hi+1(k) Hashing

  41. Linear Hashing with Partial Expansions (Larson 1980) • Problem with Linear Hashing: pages to the right of p delay to split • large chains of overflows on rightmost pages • Solution: do not wait that much to split a page • k partial expansions: take pages in groups of k • all k pages of a group split together • the file grows at lower rates Hashing

  42. 2 pointers to pages of the same group 0 1 n 2n Two Partial Expansions • Initially 2n pages, n groups, 2 pages/group • groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1) • Pages in same group spit together => some records go to a new page at end of file (position: 2n) Hashing

  43. 0 n 2n 3n 0 n2n3n 1st Expansion • After n splits, all pages are split • the file has 3n pages (1.5 time larger) • the file grows at lower rate • after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n Hashing

  44. 2 pointers to pages of the same group 0 1 2n 4n 2nd Expansion • After n splits the file has size 4n • repeat the same processhaving initially 4n pages in 2n groups Hashing

  45. 1,6 Linear Hashing 1,5 1,4 Linear Hashing disk access/retrieval 2 partial 1,3 expansions 1,2 1,1 1 1 1,2 1,4 1,6 1,8 2 relative file size LinearHashing 2 part. Exp. Linear Hashing Linear Hashing 3 part. Exp. b = 5 b’ = 5 u = 0.85 retrieval 1.17 1.12 1.09 insertion 3.57 3.21 3.31 deletion 4.04 3.53 3.56 Hashing

  46. Dynamic Hashing Schemes • Very good performance on membership, insert, delete operations • Suitable for both main memory and disk • b=1-3 records for main memory • b=1-4 Kbytes for disk • Critical parameter: space utilization u • large u=> more overflows, bad performance • small u=> less overflows, better performance • Suitable for direct access queries (random accesses) but not for range queries Hashing

More Related