190 likes | 208 Views
Searching. Given distinct keys k 1 , k 2 , …, k n and a collection of n records of the form (k 1 ,I 1 ), (k 2 ,I 2 ), …, (k n , I n ) Search Problem - For key value K , locate the record (k j , I j ) in T such that k j = K .
E N D
Searching • Given distinct keys k1, k2, …, kn and a collection of n records of the form • (k1,I1), (k2,I2), …, (kn, In) • Search Problem - For key value K, locate the record (kj, Ij) in T such that kj=K. • Searching is a systematic method for locating the record(s) with key value kj=K. • A successful search is one in which a record with key kj=K is found. • An unsuccessful search is one in which no record with kj=K is found (and does not exist).
Searching Ordered Arrays • Binary Search - been there done that. • Dictionary Search - interpolation search • Determine how far from an endpoint your value is probably going to be. • Pos=(value-A[lo])/(A[hi]-A[low]) * (hi-lo) • Look here rather than mid • Assumes the data is evenly distributed.
Lists Ordered by Frequency • Order lists by (expected) frequency of occurrence. • Perform sequential search • Cost for first record : 1 • Cost for second record : 2 • Search cost= 1p1 + 2 p2 + 3p3 + … + npn • Worst case (n+1)/2 • Best if a few items are accessed many times
Self Organizing Lists • 80/20 rule: 80% of the accesses are to 20% of the records • expected search cost = .122n • Self organizing lists modify the order of records within the list basedon the actual pattern of record accesses. • Self organizing lists use a rule called a heuristic for deciding how to reorder the list.
Self Organizing Heuristics • Order by actual frequency - most frequently used first • When a record is found, swap it with the first item • When a record is found, move it to the front of the list • When a record is found, swap it with the record ahead of it
Hashing • The process of mapping a key value to a position in a table. • A hash function maps key values to positions. • A hash table is an array that holds the records. • The hash table has M slots (0:M-1) • For any value K in the key range and some hash function h, • h(k) = I where 0≤ I<M, and key(T[I])=K
Hashing Situations • Hashing is appropriate for unique keys. • Good for both in-memory and disk based applications. • Answers the question “What record, if any, has key value K?” • Example: Store the n records with keys in range 0-(n-1). • Store the record with key i in slot i. • Uses the hash function h(k)=k. (Identity function).
Collisions • More reasonable example • Store about 1000 records with keys in the range 0-16,383. • Impractical to keep a table of size 16,384. • We need a hash function to map keys to a smaller range. • Given a hash function h and different keys k1 and k2. Let be a position in the hash table. • If h(k1 )= h(k2 )= then k1 and k2 have a collision at under h.
Collision Resolution • To search for the record with key K: • Compute the table location h(K). • Starting with slot h(K), locate the record containing key K using (if necessary) a collision resolution policy. • Collisions are inevitable in most applications. • Example: In a group of 23 people the odds are good that at least one pair share a birthday.
Hash Functions • Must return a value within the table range. • Should evenly distribute the records to be stored among the table slots. • Ideally, the function should distribute records with equal probability to all the positions. In reality, usually depends on the data. • If we know nothing about the key distribution, evenly distribute the key range among the positions. • If we know about the key distribution, use a distribution dependant hash function.
Example Hash Functions • h(key)=key % 16 - uses only last 4 bits. • H(key)=key % 1000 - uses last 4 digits. • Use % tablesize to make sure result is in the range. • Mid-square method: square the key and take the middle r bits for a table of size 2r • Sum up ASCII characters and take results modulo tablesize (a folding technique).
Collision Handling Categories • Open hashing - when there is a collision, put collided item outside the table. • Closed hashing - when there is a collision, put collided item inside the table.
Open Hashing • Look at each table element as the head of a linked list of items that has to that position. • Can organize the linked lists in many ways • ordered : unsuccessful searches are quickly found. • Ordered by frequency: if a few are searched for frequently, then this is a good technique. • If there are N records to be stored and the table is of size M then the average search length is O(N/M). • Good for internal memory. Linked nodes may be in different blocks on disk and cause many disk accesses.
Closed Hashing - Linear Probe • If the item you are looking for is not in the hash position, look in the next position. • Do the same for insert until you find an empty location. • When you reach the bottom, go to the beginning. • Must have at least one empty slot or there will be an infinite loop. • Tends to have clustering since the collision position is not uniformly distributed (i.e. if collide at position 4, go to position 5, then 6, independent of key).
Better Linear Probe • Instead of going to the next slot, skip by some constant c. • The tablesize M and c should be relatively prime. • This assures the probing will cycle through all the table. • Still has some clustering.
Quadratic Probe • Instead of adding 1 to the key add i2 • i is the probe sequence, so add 1, 4, 9, 16,... • Remember we also mod with table size.
Double Hashing • After a collision, use a different hash function. • Eliminates clustering to some degree. • For example if h(k) causes a collision then use • p(k,i)= i*h2(k) • h2 is a different hash function • generates a different probe sequence
Analysis of Closed Hashing • load factor =lf=N/M • N is the number of records • M is the size of the table • N/M is the percent full • The larger the load factor the greater the probability of a collision • Average search length is O(1/(1-lf))
Deletions • If we delete a value it may stop the search prematurely (break the chain). • Use a special mark to indicate something was deleted. When searching continue if see this mark rather than stopping as if it was empty. • Once we have many deleted items we may wish to rehash everything remaining • best if we rehash the most frequently accessed items first.