300 likes | 330 Views
Dictionary search. Exact string search. Paper on Cuckoo Hashing. Exact String Search. Given a dictionary D of K strings , of total length N , store them in a way that we can efficiently support searches for a pattern P over them. Hashing. Hashing with chaining.
E N D
Dictionary search Exact string search Paper on Cuckoo Hashing
Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing
Key issue: a good hash function Basic assumption:Uniform hashing • Avg #keys per slot = n * (1/m) = n/m • =a(load factor)
Search cost m = Q(n)
In practice A trivial hash function is: prime
A “provably good”hash is l = max string len m = table size ≈log2 m • Each ai is selected at random in [0,m) a0 k0 k1 a1 k2 a2 kr ar K prime r ≈ L / log2 m a not necessarily: (...mod p) mod m
Cuckoo Hashing A B C E D 2 hash tables, and 2 random choices where an item can be stored
A running example A B C F E D
A running example A B C F E D
A running example A B C F G E D
A running example E G B C F A D
Cuckoo Hashing Examples A B C G E D F Random (bipartite) graph: node=cell, edge=key
Natural Extensions • More than 2 hashes (choices) per key. • Very different: hypergraphs instead of graphs. • Higher memory utilization • 3 choices : 90+% in experiments • 4 choices : about 97% • 2 hashes + bins of B-size. • Balanced allocation and tightly O(1)-size bins • Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory ...but more local
Dictionary search Making one-side errors Paper on Bloom Filter
Crawling How to keep track of the URLs visited by a crawler? • URLs are long • Check should be very fast • No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs
2 TTT
Opt k = 5.45... m/n = 8 We do have an explicit formula for the optimal k
Dictionary search Prefix-string search Reading 3.1 and 5.2
Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.
2 2 0 5 1 1 4 5 6 7 2 3 Trie: speeding-up searches s y z omo aibelyite stile zyg czecin etic ygy ial Pro: O(p) search time Cons: edge + node labels and tree structure
5 5 2 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... Front-coding: squeezing strings ….systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Gzip may be much better...
Internal Memory Disk 2-level indexing • 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets CT on a sample • A disadvantage: • Trade-off ≈ speed vsspace (because of bucket size) systileszaielyite ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….