460 likes | 474 Views
Hashing on the Disk. Keys are stored in “ disk pages ” (“ buckets ”) several records fit within one page Retrieval: find address of page bring page into main memory searching within the page comes for free. data pages. Σ key space. 0. 1. 2. hash function. m-1. b.
E N D
Hashing on the Disk • Keys are stored in “disk pages” (“buckets”) • several records fit within one page • Retrieval: • find address of page • bring page into main memory • searching within the page comes for free Hashing
data pages Σ key space 0 1 2 hash function . . . . m-1 b • page size b: maximum number of records in page • space utilization u: measure of the use of space Hashing
overflow x x x x x x Collisions • Keys that hash to the same address are stored within the same page • If the page is full: • page splits: allocate a new page and split page content between the old and the new page or • overflows: list of overflow pages Hashing
Access Time • Goal: find key in one disk access • Access time ~number of accesses • Large u: good space utilization but many overflows or splits => more disk accesses • Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses Hashing
Categories of Methods • Static: require file reorganization • open addressing, separate chaining • Dynamic: dynamic file growth, adapt to file size • dynamic hashing, • extendible hashing, • linear hashing, • spiral storage… Hashing
Static Hashing Schemes • Open Addressing for disk • One file: array of pointers to list of disk pages • Data files: array of data pages with keys mapped to the same address • Open Addressing: One file with pages holding keys mapped to the same address • Performance drops as u increases Hashing
Dynamic Hashing Schemes • File size adapts to data size without total reorganization • Typically 1-3 disk accesses to access a key • Access time and u are a typical trade-off • u between 50-100% (typically 69%) • Complicated implementation Hashing
index data pages Schemes With Index • Two disk accesses: • one to access the index, one to access the data • with index in main memory => one disk access • Problem: the index may become too large • Dynamic hashing (Larson 1978) • Extendible hashing (Fagin et.al. 1979) Hashing
address space data space Schemes Without Index • Ideally, less space and less disk accesses (at least one) • Linear Hashing (Litwin 1980) • Linear Hashing with Partial Expansions (Larson 1980) • Spiral Storage (Martin 1979) Hashing
Hash Functions • Support for shrinking or growing file • shrinking or growing address space, the hash function adapts to these changes • hash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0 • hi(key)=bi-1…b2b1b0supports 2i addresses • hi: one more bit thanhi-1to address larger files Hashing
Index: binary tree 1 2 3 4 b data pages h2(k) 2nd level h1(k) 1stlevel Dynamic Hashing (Larson1978) • Two level index • primary h1(key): accesses a hash table • secondary h2(key): accesses a binary tree Hashing
Index • Fixed (static): h1(key)=key mod m • Dynamic behavior on secondary index • h2(key) uses i bits of key • the bit sequence of h2=bi-1…b2b1b0denotes which path on the binary tree index to follow in order to access the data page • scan h2 from right to left (bit 1: follow right path, bit 0: follow left path) Hashing
index 0 1 2 3 4 5 0 h1=1, h2=“0” 1 0 h1=1, h2=“01” 2 1 h1=1, h2=“11” 3 1 0 h1=5, h2= any 4 b data pages h2(k) 2nd level h1(k) 1stlevel h1(key) = key mod 6 h2(key) = “01”<= depth of binary tree = 2 Hashing
b 0 1 2 3 0 1 2 3 h1=1,h2=any 0 0 1 2 3 1 Insertions • Initially fixed size primary index and no data • insert record in new page under h1address • if page is full, allocate one extra page • split keys between old and new page • use one extra bit in h2 for addressing h1=1, h2=0 h1=1, h2=1 Hashing
0 1 h1=0,h2=any 0 1 1 b 2 2 h1=3,h2=any 2 3 3 storage index 0 0 1 h1=0,h2=0 1 1 3 h1=0,h2=1 2 3 2 h1=3,h2=any 0 h1=0,h2=0 1 0 0 3 1 h1=0,h2=01 1 1 h1=0,h2=11 4 2 0 h1=3,h2=0 2 3 1 h1=3,h2=1 5 Hashing
Deletions • Find record to be deleted using h1, h2 • Delete record • Check “sibling” page: • less than b records in both pages ? • if yes merge the two pages • delete one empty page • shrink binary tree index by one level and reduce h2 by one bit Hashing
1 0 merging 1 2 2 3 3 4 1 0 delete 1 2 2 3 3 4 Hashing
Extendible Hashing (Fagin et.al. 1979) • Dynamic hashing without index • Primary hashing is omitted • Only secondary hashing with all binary trees at the same level • The index shrinks and grows according to file size • Data pages attached to the index Hashing
0 0 1 2 3 4 dynamic hashing 1 0 0 1 2 3 4 1 0 1 number of address bits 00 01 10 11 2 2 1 dynamic hashing with all binary trees at same level Hashing
storage index 0 0 b Insertions • Initially 1 index and 1 data page • 0 address bits • insert records in data page local depth l : Number of address bits global depth d: size of index 2d Hashing
storage index d l 0 0 d: global depth = 1 l : local depth = 1 b l d 1 1 0 1 1 Page “0” Overflows Hashing
Page “0” Overflows (cont.) • 1 more key bit for addressing and 1 extra page => index doubles !! • Split contents of previous page between 2 pages according to next bit of key • Global depthd: number of index bits =>2d index size • Local depthl : number of bits for record addressing Hashing
contain records with same 2 bits of key d 2 2 00 01 10 11 2 1 contains records with same 1st bit of key Page “0” Overflows (again) Hashing
3 2 000 001 010 011 100 101 110 111 d 3 1 more key bit for addressing 3 2d-l: number of pointers to page 1 Page “01” Overflows Hashing
2 3 000 001 010 011 100 101 110 111 3 3 2 +1 2 Page “100” Overflows • no need to double index • page 100 splits into two (1 new page) • local depth l is increased by 1 Hashing
Insertion Algorithm • If l < d, split overflowed page (1 extra page) • If l = d => index is doubled, page is split • d is increased by 1=>1 more bit for addressing • update pointers (either way): • if d prefix bits are used for addressing d=d+1; for (i=2d-1, i>=0,i--) index[i]=index[i/2]; • if d suffix bitsare used for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1; d=d+1 Hashing
Deletion Algorithm • Find and delete record • Check sibling page • If less than b records in both pages • merge pages and free empty page • decrease local depth l by 1 (records in merged page have 1 less common bit) • if l < d everywhere => reduce index (half size) • update pointers Hashing
2 3 000 001 010 011 100 101 110 111 3 delete with merging 3 2 2 3 2 000 001 010 011 100 101 110 111 2 2 2 00 01 10 11 l < d 2 2 2 2 2 Hashing
Observations • A page splits and there are more than b keys with same next bit • take one more bit for addressing (increase l) • if d=l the index doubles again !! • Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) • if distribution is known, transform it to uniform • Dynamic hashing performs better for non-uniform distributions (affected locally) Hashing
Performance • For n: records and page size b • expected size of index (Flajolet) • 1 disk access/retrieval when index in main memory • 2 disk accesses when index is on disk • overflows increase number of disk accesses Hashing
b b before splitting after splitting After splitting Storage Utilization with Page Splitting • In general 50% < u < 100% • On the average u ~ ln2 ~ 69% (no overflows) Hashing
b b Storage Utilization with Overflows • Achieves higher u and avoids page doubling (d=l) • higher u is achieved for small overflow pages • u=2b/3b~66% after splitting • small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75% • double index only if the overflow overflows!! Hashing
Linear Hashing (Litwin 1980) • Dynamic scheme without index • Indices refer to page addresses • Overflows are allowed • The file grows one page at a time • The page which splits is not always the one which overflowed • The pages split in a predetermined order Hashing
p b p b Linear Hashing (cont.) • Initially n empty pages • p points to the page that splits • Overflows are allowed Hashing
p File Growing • A page splits whenever the “splitting criterion” is satisfied • a page is added at the end of the file • pointer p points to the next page • split contents of old page between old and new page based on key values Hashing
p 0 1 2 3 4 125 320 90 435 16 711 402 27 737 712 613 303 4 319 • b=bpage=4, boverflow=1 • initially n=5 pages • hash function h0=k mod 5 • splitting criterion u > A% • alternatively split when overflow overflows, etc. new element 438 215 522 Hashing
p 0 1 2 3 4 5 320 90 16 711 402 27 737 712 613 303 438 4 319 125 435 215 522 • Page 5 is added at end of file • The contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10 • p points to the next page Hashing
Hash Functions • Initially h0=key mod n • As new pages are added at end of file, h0 alone becomes insufficient • The file will eventually double its size • In that case use h1=key mod 2n • In the meantime • use h0 for pages not yet split • use h1 for pages that have already split • Split contents of page pointed to by p based on h1 Hashing
Hash Functions (cont.) • When the file has doubled its size, h0 is no longer needed • set h0=h1 and continue (e.g., h0=k mod 10) • The file will eventually double its size again • Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) Hashing
Hash Functions • Initially n pages and 0 <= h0(k) <= n • Series of hash functions • Selection of hash function: if hi(k) >= p then use hi(k) else use hi+1(k) Hashing
Linear Hashing with Partial Expansions (Larson 1980) • Problem with Linear Hashing: pages to the right of p delay to split • large chains of overflows on rightmost pages • Solution: do not wait that much to split a page • k partial expansions: take pages in groups of k • all k pages of a group split together • the file grows at lower rates Hashing
2 pointers to pages of the same group 0 1 n 2n Two Partial Expansions • Initially 2n pages, n groups, 2 pages/group • groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1) • Pages in same group spit together => some records go to a new page at end of file (position: 2n) Hashing
0 n 2n 3n 0 n2n3n 1st Expansion • After n splits, all pages are split • the file has 3n pages (1.5 time larger) • the file grows at lower rate • after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n Hashing
2 pointers to pages of the same group 0 1 2n 4n 2nd Expansion • After n splits the file has size 4n • repeat the same processhaving initially 4n pages in 2n groups Hashing
1,6 Linear Hashing 1,5 1,4 Linear Hashing disk access/retrieval 2 partial 1,3 expansions 1,2 1,1 1 1 1,2 1,4 1,6 1,8 2 relative file size LinearHashing 2 part. Exp. Linear Hashing Linear Hashing 3 part. Exp. b = 5 b’ = 5 u = 0.85 retrieval 1.17 1.12 1.09 insertion 3.57 3.21 3.31 deletion 4.04 3.53 3.56 Hashing
Dynamic Hashing Schemes • Very good performance on membership, insert, delete operations • Suitable for both main memory and disk • b=1-3 records for main memory • b=1-4 Kbytes for disk • Critical parameter: space utilization u • large u=> more overflows, bad performance • small u=> less overflows, better performance • Suitable for direct access queries (random accesses) but not for range queries Hashing