1 / 30

Dictionary search

Dictionary search. Exact string search. Paper on Cuckoo Hashing. Exact String Search. Given a dictionary D of K strings , of total length N , store them in a way that we can efficiently support searches for a pattern P over them. Hashing. Hashing with chaining.

veda-cote
Download Presentation

Dictionary search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dictionary search Exact string search Paper on Cuckoo Hashing

  2. Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing

  3. Hashing with chaining

  4. Key issue: a good hash function Basic assumption:Uniform hashing • Avg #keys per slot = n * (1/m) = n/m • =a(load factor)

  5. Search cost m = Q(n)

  6. In practice A trivial hash function is: prime

  7. A “provably good”hash is l = max string len m = table size ≈log2 m • Each ai is selected at random in [0,m) a0 k0 k1 a1 k2 a2 kr ar K prime r ≈ L / log2 m a not necessarily: (...mod p) mod m

  8. Cuckoo Hashing A B C E D 2 hash tables, and 2 random choices where an item can be stored

  9. A running example A B C F E D

  10. A running example A B C F E D

  11. A running example A B C F G E D

  12. A running example E G B C F A D

  13. Cuckoo Hashing Examples A B C G E D F Random (bipartite) graph: node=cell, edge=key

  14. Natural Extensions • More than 2 hashes (choices) per key. • Very different: hypergraphs instead of graphs. • Higher memory utilization • 3 choices : 90+% in experiments • 4 choices : about 97% • 2 hashes + bins of B-size. • Balanced allocation and tightly O(1)-size bins • Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory ...but more local

  15. Dictionary search Making one-side errors Paper on Bloom Filter

  16. Crawling How to keep track of the URLs visited by a crawler? • URLs are long • Check should be very fast • No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs

  17. Searching with errors...

  18. Problem: false positives

  19. 2 TTT

  20. Not perfectly true but...

  21. Opt k = 5.45... m/n = 8 We do have an explicit formula for the optimal k

  22. Dictionary search Prefix-string search Reading 3.1 and 5.2

  23. Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

  24. 2 2 0 5 1 1 4 5 6 7 2 3 Trie: speeding-up searches s y z omo aibelyite stile zyg czecin etic ygy ial Pro: O(p) search time Cons: edge + node labels and tree structure

  25. 5 5 2 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... Front-coding: squeezing strings ….systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Gzip may be much better...

  26. Internal Memory Disk 2-level indexing • 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets CT on a sample • A disadvantage: • Trade-off ≈ speed vsspace (because of bucket size) systileszaielyite ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

More Related