Hashing

Hashing Chapter 8

Symbol Table • Symbol table is used widely in many applications. • dictionary is a kind of symbol table • data dictionary is database management • In general, the following operations are performed on a symbol table • determine if a particular name is in the table • retrieve the attribute of that name • modify the attributes of that name • insert a new name and its attributes • delete a name and its attributes

Symbol Table • Popular operations on a symbol table include search, insertion, and deletion • A binary search tree could be used to represent a symbol table. • The worse-case complexities for the operations are O(n). • Hashing: insertions, deletions & finds in O(1). • Static hashing • Dynamic hashing

Static Hashing • Dictionary pairs are stored in a fixed-size table called hash table. • The address of location of an dictionary pairs, x, is obtained by computing some arithmetic function h(x). • The memory available to maintain the symbol table (hash table) is assumed to be sequential. • The hash table consists of bbuckets and each bucket contains s records. • h(x) maps the set of possible dictionary pairsonto the integers 0 through b-1.

Hash Tables • The key density of a hash table is the ratio n/T, where n is the number of dictionary pairs in the table and T is the total number of possible keys. • The loading density or loading factor of a hash table is α= n/(sb).

Hash Tables • Two keys, I1, and I2, are said to be synonyms with respect to h if h(I1) = h(I2). • An overflow occurs when a new keyiis mapped or hashed by h into a full bucket. • A collision occurs when two non-identical keys are hashed into the same bucket. • If the bucket size is 1, collisions and overflows occur at the same time.

Example 8.1 • b=26, s=2 • Assume there are 10 distinct keysGA, A,G,L,A2,A1,A3,A4 and E • If no overflow occur, the time required for hashing depends only on the time required to compute the hash function h. • A1?

Hash Functions • Requirements • Simple to compute • Minimize the number of collisions • Dependent upon all the characters in the keys • Uniform hash function • If there are b buckets, we hope to have h(x)=iwith the probability being (1/b) • Mid-square, division, folding, digit analysis

Division • Using the modulo (%) operator. • A key kis divided by some number D and the remainder is used as the hash address for k. • The bucket addresses are in the range of 0 through D-1. • If Dis a power of 2, then h(k) depends only on the least significant bits of k.

Division • If a division function h is used as the hash function, the table size should not be a power of two. • Since programmers have a tendency to use many variables with the same suffix  too many collisions • If Dis divisible by two, the odd keys are mapped to odd buckets and even keys are mapped to even buckets. Thus, the hash table is biased.

Mid-Square • Mid-Square function • It is computed by squaring the key and then using an appropriate number of bits from the middle of the square to obtain the bucket address. • Table size is a power of two

Folding • The key kis partitioned into several parts, all but the last being of the same length. • All partitions are added together to obtain the hash address for k. • Shift folding: different partitions are added together to get h(k). • Folding at the boundaries: key is folded at the partition boundaries, and digits falling into the same position are added together to obtain h(k). • This is similar to reversing every other partition and then adding.

Example 8.2 • k=12320324111220 are partitioned into three decimal digits long. • P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20. • Shift folding: • h(k) = 123 + 203 + 241 + 112 + 20 = 699 • Folding at the boundaries: (reverse P2 and P4) • h(k) = 123+302+241+211+20=897

Secure Hash Functions • Can be applied to any sized message M • Produces fixed-length output h • Is easy to compute h=H(M) for any message M • Given h is infeasible to find x s.t. H(x)=h • one-way property • Given x is infeasible to find y s.t. H(y)=H(x) • weak collision resistance • Is infeasible to find any x,ys.t. H(y)=H(x) • strong collision resistance

Secure Hash Algorithm (SHA) • 步驟 1: 對訊息做前處理使得它的長度成為q*512位元，其中q是某個整數。前處理的程序可能得在訊息的尾端加上許多個0的字串。 • 步驟2: 初始化160-位元的輸出緩衝區OB，它包含五個32-位元的暫存器A, B, C, D, E，其中A = 67453401，B = efcdab89，C = 98badcfe，D = 10325476，E = c3d2e1f0。（所有的值為16進位）。 • 步驟3: for (inti = 1; i <= q; i++) { 令Bi = 訊息中第i個512位元區塊; OB = F (OB, Bi); } • 步驟4: 輸出OB。

t=步驟數；0 ≤ t ≤ 79 ft(B, C, D)=步驟 t的基本（位元的）邏輯函數；例如，(B^C) v (B^D) Sk=把32位元的暫存器左迴旋k個位元 Wt=從Bi推導出來的32位元值 Kt=常數 Atomic SHA Operation

Example 8.4 • A = 67453401，B = efcdab89，C = 98badcfe，D = 10325476，E = c3d2e1f0 • ft(B,C,D)=(BCD), K0=5a82799, W0=0000000 • B=1110 1111 1100 1101 1010 1011 1000 1001 • S30(B)=01111011111100110110101011100010=7bf36ae2=C • A=6e3edeb6 • B=67453401, D=98badcfe, E=10325476

Overflow Handling • There are two ways to handle overflow: • Open addressing • Linear probing • Quadratic probing • Rehasing • Chaining

Open Addressing • Assumes the hash table is an array • The hash table is initialized so that each slot contains the null key. • When a new key is hashed into a full bucket, find the closest unfilled bucket. • linear probing or linear open addressing

Linear Probing(Example 8.6) • Assume 26-bucket table with one slot per bucket and the following keys: GA, D, A, G, L, A2, A1, A3, A4, Z, ZA, E. Let the hash function h(k) = first character of k. • When entering G, G collides with GA and is entered at ht[7] instead of ht[6].

Linear Probing • When linear open address is used to handle overflows, a hash table search for key k proceeds as follows: • compute h(k) • examine key at positions ht[h(k)], ht[(h(k)+1) % b], …, ht[(h(k)+j) % b], in this order until one of the following condition happens: • ht[(h(k)+j) % b]=k; in this case k is found • ht[h(k)+j] is empty; kis not in the table • We return to the starting position ht[h(k)]; the table is full and k is not in the table

Linear Probing • When linear probing is used to resolve overflows, keys tend to cluster together. • This increases search time. • e.g., to find ZA, you need to examine ht[25], ht[0], …, ht[8] (total of 10 comparisons). • The number of comparisons to look up a key is approximated (2-α)/(2-2α), where α is the load density

Quadratic Probing • One of the problems of linear open addressing is that it tends to create clusters of keys. • These clusters tend to merge as more keys are entered, leading to bigger clusters. • It is difficult to find an unused bucket. • A quadratic probing scheme improves the growth of clusters. A quadratic function of i is used as the increment when searching through buckets. • Perform search by examining bucket h(k), (h(k)+i2)%b, (h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2.

Example • (h(k)+i2)%b, (h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2. • Example: b=10, insert: 89, 18, 49, 58, 69 • 89 % 10=9, 18 % 10=8 • 49 % 10=9, (9+12) % 10=0 • 58 % 10=8, (8+12) % 10=9, (8+22) % 10=2 • 69 % 10=9, (9+12) % 10=0, (9+22) % 10=3

Rehashing • Another way to control the growth of clusters is to use a series of hash functions h1, h2, …, hm. This is called rehashing. • Buckets hi(k), 1 ≤ i ≤ m are examined in that order.

Chaining • Linear probing performs poorly because the search for a key involves comparisons with keys that have different hash values. • Unnecessary comparisons can be avoided if all the synonyms are put in the same list, where one list per bucket. • Each chain has a head node. Head nodes are stored sequentially.

Example Average search length is (6*1+3*2+1*3+1*4+1*5)/12 = 2

Dynamic Hashing • Retain the fast retrieval time • Dynamically increasing and decreasing file size without penalty • Dynamic hashing is to minimize access to pages

An Example Hash Function A: 100, B: 101, C: 110

Hashing With Directory • Accessing any page only requires two steps: • First step: use the hash function to find the address of the directory entry. • Second step: retrieve the page associated with the address

Dynamic Hash Tables with Directories • Insert C5 into the hash table: • h(C5,2)=01, d[01]A1, B1overflow • determine the least u such that h(k,u) is not the same for all • keys in the overflow bucket  suppose that u=3 • h(A1,3)=001, h(B1,3)=001, h(C5,3)=101 • Insert C1 into the hash table: • h(C1,2)=01, d[01]A1, B1overflow • determine the least u such that h(k,u) is not the same for all • keys in the overflow bucket  suppose that u=4 • h(A1,4)=0001, h(B1,4)=1001, h(C1,4)=0001

Directoryless Dynamic Hashing • Hashing with directory requires at least one level of indirection • Directorylesshashing assume a continuous address space in the memory to hold all the records. Therefore, the directory is not needed. • Thus, the hash function must produce the actual address of a page containing the key. • Contrasting to the directory scheme in which a single page might be pointed at by several directory entries, in the directorylessscheme one unique page is assigned to every possible address.

Inserting into a Directoryless Dynamic Hash Table 0~2r+q, 0≤q<2r • 0~q-1 and 2r~2r+q-1 are • indexed using h(k,r+1); • else, h(k,r) • Insert C5 (110101) • into the hash table: • h(C5,2)=01, d[01] • A1, B5overflow • activated 2r+q  • reallocating  (b) • 000, 100 indexed • using h(k,3)

Hashing

Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing