1 / 33

Hashing

Hashing. Chapter 8. Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on a symbol table determine if a particular name is in the table

weylin
Download Presentation

Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing Chapter 8

  2. Symbol Table • Symbol table is used widely in many applications. • dictionary is a kind of symbol table • data dictionary is database management • In general, the following operations are performed on a symbol table • determine if a particular name is in the table • retrieve the attribute of that name • modify the attributes of that name • insert a new name and its attributes • delete a name and its attributes

  3. Symbol Table • Popular operations on a symbol table include search, insertion, and deletion • A binary search tree could be used to represent a symbol table. • The worse-case complexities for the operations are O(n). • Hashing: insertions, deletions & finds in O(1). • Static hashing • Dynamic hashing

  4. Static Hashing • Dictionary pairs are stored in a fixed-size table called hash table. • The address of location of an dictionary pairs, x, is obtained by computing some arithmetic function h(x). • The memory available to maintain the symbol table (hash table) is assumed to be sequential. • The hash table consists of bbuckets and each bucket contains s records. • h(x) maps the set of possible dictionary pairsonto the integers 0 through b-1.

  5. Hash Tables • The key density of a hash table is the ratio n/T, where n is the number of dictionary pairs in the table and T is the total number of possible keys. • The loading density or loading factor of a hash table is α= n/(sb).

  6. Hash Tables • Two keys, I1, and I2, are said to be synonyms with respect to h if h(I1) = h(I2). • An overflow occurs when a new keyiis mapped or hashed by h into a full bucket. • A collision occurs when two non-identical keys are hashed into the same bucket. • If the bucket size is 1, collisions and overflows occur at the same time.

  7. Example 8.1 • b=26, s=2 • Assume there are 10 distinct keysGA, A,G,L,A2,A1,A3,A4 and E • If no overflow occur, the time required for hashing depends only on the time required to compute the hash function h. • A1?

  8. Hash Functions • Requirements • Simple to compute • Minimize the number of collisions • Dependent upon all the characters in the keys • Uniform hash function • If there are b buckets, we hope to have h(x)=iwith the probability being (1/b) • Mid-square, division, folding, digit analysis

  9. Division • Using the modulo (%) operator. • A key kis divided by some number D and the remainder is used as the hash address for k. • The bucket addresses are in the range of 0 through D-1. • If Dis a power of 2, then h(k) depends only on the least significant bits of k.

  10. Division • If a division function h is used as the hash function, the table size should not be a power of two. • Since programmers have a tendency to use many variables with the same suffix  too many collisions • If Dis divisible by two, the odd keys are mapped to odd buckets and even keys are mapped to even buckets. Thus, the hash table is biased.

  11. Mid-Square • Mid-Square function • It is computed by squaring the key and then using an appropriate number of bits from the middle of the square to obtain the bucket address. • Table size is a power of two

  12. Folding • The key kis partitioned into several parts, all but the last being of the same length. • All partitions are added together to obtain the hash address for k. • Shift folding: different partitions are added together to get h(k). • Folding at the boundaries: key is folded at the partition boundaries, and digits falling into the same position are added together to obtain h(k). • This is similar to reversing every other partition and then adding.

  13. Example 8.2 • k=12320324111220 are partitioned into three decimal digits long. • P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20. • Shift folding: • h(k) = 123 + 203 + 241 + 112 + 20 = 699 • Folding at the boundaries: (reverse P2 and P4) • h(k) = 123+302+241+211+20=897

  14. Secure Hash Functions • Can be applied to any sized message M • Produces fixed-length output h • Is easy to compute h=H(M) for any message M • Given h is infeasible to find x s.t. H(x)=h • one-way property • Given x is infeasible to find y s.t. H(y)=H(x) • weak collision resistance • Is infeasible to find any x,ys.t. H(y)=H(x) • strong collision resistance

  15. Secure Hash Algorithm (SHA) • 步驟 1: 對訊息做前處理使得它的長度成為q*512位元,其中q是某個整數。前處理的程序可能得在訊息的尾端加上許多個0的字串。 • 步驟2: 初始化160-位元的輸出緩衝區OB,它包含五個32-位元的暫存器A, B, C, D, E,其中A = 67453401,B = efcdab89,C = 98badcfe,D = 10325476,E = c3d2e1f0。(所有的值為16進位)。 • 步驟3: for (inti = 1; i <= q; i++) { 令Bi = 訊息中第i個512位元區塊; OB = F (OB, Bi); } • 步驟4: 輸出OB。

  16. t=步驟數;0 ≤ t ≤ 79 ft(B, C, D)=步驟 t的基本(位元的)邏輯函數; 例如,(B^C) v (B^D) Sk=把32位元的暫存器左迴旋k個位元 Wt=從Bi推導出來的32位元值 Kt=常數 Atomic SHA Operation

  17. Example 8.4 • A = 67453401,B = efcdab89,C = 98badcfe,D = 10325476,E = c3d2e1f0 • ft(B,C,D)=(BCD), K0=5a82799, W0=0000000 • B=1110 1111 1100 1101 1010 1011 1000 1001 • S30(B)=01111011111100110110101011100010=7bf36ae2=C • A=6e3edeb6 • B=67453401, D=98badcfe, E=10325476

  18. Overflow Handling • There are two ways to handle overflow: • Open addressing • Linear probing • Quadratic probing • Rehasing • Chaining

  19. Open Addressing • Assumes the hash table is an array • The hash table is initialized so that each slot contains the null key. • When a new key is hashed into a full bucket, find the closest unfilled bucket. • linear probing or linear open addressing

  20. Linear Probing(Example 8.6) • Assume 26-bucket table with one slot per bucket and the following keys: GA, D, A, G, L, A2, A1, A3, A4, Z, ZA, E. Let the hash function h(k) = first character of k. • When entering G, G collides with GA and is entered at ht[7] instead of ht[6].

  21. Linear Probing • When linear open address is used to handle overflows, a hash table search for key k proceeds as follows: • compute h(k) • examine key at positions ht[h(k)], ht[(h(k)+1) % b], …, ht[(h(k)+j) % b], in this order until one of the following condition happens: • ht[(h(k)+j) % b]=k; in this case k is found • ht[h(k)+j] is empty; kis not in the table • We return to the starting position ht[h(k)]; the table is full and k is not in the table

  22. Linear Probing • When linear probing is used to resolve overflows, keys tend to cluster together. • This increases search time. • e.g., to find ZA, you need to examine ht[25], ht[0], …, ht[8] (total of 10 comparisons). • The number of comparisons to look up a key is approximated (2-α)/(2-2α), where α is the load density

  23. Quadratic Probing • One of the problems of linear open addressing is that it tends to create clusters of keys. • These clusters tend to merge as more keys are entered, leading to bigger clusters. • It is difficult to find an unused bucket. • A quadratic probing scheme improves the growth of clusters. A quadratic function of i is used as the increment when searching through buckets. • Perform search by examining bucket h(k), (h(k)+i2)%b, (h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2.

  24. Example • (h(k)+i2)%b, (h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2. • Example: b=10, insert: 89, 18, 49, 58, 69 • 89 % 10=9, 18 % 10=8 • 49 % 10=9, (9+12) % 10=0 • 58 % 10=8, (8+12) % 10=9, (8+22) % 10=2 • 69 % 10=9, (9+12) % 10=0, (9+22) % 10=3

  25. Rehashing • Another way to control the growth of clusters is to use a series of hash functions h1, h2, …, hm. This is called rehashing. • Buckets hi(k), 1 ≤ i ≤ m are examined in that order.

  26. Chaining • Linear probing performs poorly because the search for a key involves comparisons with keys that have different hash values. • Unnecessary comparisons can be avoided if all the synonyms are put in the same list, where one list per bucket. • Each chain has a head node. Head nodes are stored sequentially.

  27. Example Average search length is (6*1+3*2+1*3+1*4+1*5)/12 = 2

  28. Dynamic Hashing • Retain the fast retrieval time • Dynamically increasing and decreasing file size without penalty • Dynamic hashing is to minimize access to pages

  29. An Example Hash Function A: 100, B: 101, C: 110

  30. Hashing With Directory • Accessing any page only requires two steps: • First step: use the hash function to find the address of the directory entry. • Second step: retrieve the page associated with the address

  31. Dynamic Hash Tables with Directories • Insert C5 into the hash table: • h(C5,2)=01, d[01]A1, B1overflow • determine the least u such that h(k,u) is not the same for all • keys in the overflow bucket  suppose that u=3 • h(A1,3)=001, h(B1,3)=001, h(C5,3)=101 • Insert C1 into the hash table: • h(C1,2)=01, d[01]A1, B1overflow • determine the least u such that h(k,u) is not the same for all • keys in the overflow bucket  suppose that u=4 • h(A1,4)=0001, h(B1,4)=1001, h(C1,4)=0001

  32. Directoryless Dynamic Hashing • Hashing with directory requires at least one level of indirection • Directorylesshashing assume a continuous address space in the memory to hold all the records. Therefore, the directory is not needed. • Thus, the hash function must produce the actual address of a page containing the key. • Contrasting to the directory scheme in which a single page might be pointed at by several directory entries, in the directorylessscheme one unique page is assigned to every possible address.

  33. Inserting into a Directoryless Dynamic Hash Table 0~2r+q, 0≤q<2r • 0~q-1 and 2r~2r+q-1 are • indexed using h(k,r+1); • else, h(k,r) • Insert C5 (110101) • into the hash table: • h(C5,2)=01, d[01] • A1, B5overflow • activated 2r+q  • reallocating  (b) • 000, 100 indexed • using h(k,3)

More Related