1 / 17

Efficient Data Structures for Web Search Engines

Explore different data structures like Binary Search Trees, Radix Trees, and Tries for a web search engine with vast document indexing needs. Learn about space and time requirements, Boolean searches, and optimizing search results relevancy.

Download Presentation

Efficient Data Structures for Web Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design a Data Structure • Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) • index say 100,000,000 documents of 1000 words each  100 billion word occurrences. • average word length: 8 • speed determined largely by disk accesses • may want boolean searches (e.g. “banana” and “slug”) • order results by relevance (title, keywords, repetitions…) • what data structure, algorithms? • what will the space requirements of your data structure be? • what will the time requirements be?

  2. Search Engine Ideas • Binary search tree • With a node for each word occurrence, memory needed: 100 billion nodes, 20-30 bytes each? • Insert, delete, find O(log n) – would that be OK? • Or one node for all occurrences of a word, with a linked list of pointers to documents? • perhaps 10 million nodes, each with a 10,000 element list? • keep nodes (but not lists) in RAM • each element of list has URL, title, excerpt – 8K bytes? • How about a list of documents with excerpts. • 1. Banana Slugs, http://…, “Banana slugs are yellow, 8” long…” • 8K per document would be 800 GB for the whole list.

  3. Getting results • What should we store at the nodes of the BST? • A “hit list” for a word? 10000 entries? • Store a pointer to a hit list instead, to minimize BST size • For each hit store document number and byte offset • Order hit list by relevance criteria • Size of hit list: 8GB? • How many disk accesses to find the hits in a BST? • At 100 million * 20-30 bytes per node, the BST is large. Can we store it all in RAM? • How to perform a Boolean search? • or: union two lists (merge) • and: intersect two lists (merge-like algorithm) • Total disk accesses needed? • search BST + access hit list + access each document’s info

  4. A Better Data Structure • BSTs waste space. Much duplication in the keys • BSTs waste comparison time, for the same reason • Can we use the ideas of Radix Sort? • Search by bit? or by letter? • Build a search tree, but… • Go left if first bit is 0, right for 1 • Or, nodes have 26 children, for a..z • Words at the leaves. (Different sort of node.) • Each leaf node is a “hit list” • Don’t need to store the words! • How much space is needed? • suppose you have all 11.9M 5-letter words. • space for tree about 1 pointer per word, 4 bytes, vs. 20(?) in BST • Space savings possible--but what about wasted pointer space?

  5. Radix Search (Ch. 15) • Radix-search methods provide reasonable worst-case performance without balanced-tree complexity • Space savings are also possible. • They work by comparing pieces (“bytes”) of the key rather than the whole key, as in a BST • Analogous to Radix Sorting methods • Called “tries” for retrieval (but, ironically, pronounced like the word tries)

  6. Symbol Tables (Ch. 12 quickie) • But first, a word about symbol tables and BSTs (review) • Symbol table: store items. retrieve them by key. • e.g. a compiler’s symbol table • e.g. a database with primary key • e.g. Perl’s hash data structure (essentially an array indexed by a word.) $phone{“john”} = “x6789”. • fundamental to much of computation • Symbol table ADT (with additional desirable ops): • insert, delete, find • select (kth largest) • sort • union (of two symbol tables) • Extensively studied and still an area of active research(eg web)

  7. BSTs for Symbol Tables • The Binary Search Tree is a common data structure used to implement symbol tables • Operations: • insert, delete, find – recursive algs, O(n) worst case • O(log n) worst case in balanced BSTs • sort – inorder traversal • O(n) • kth largest? • augment tree with number of descendants stored at each node • O(log n) time in a balanced BST • pred, succ? • union?

  8. Digital Search Trees (Ch. 15 again) • Like a BST, but go left for 0, right for 1 in the bit in question • Store key at node • Root is most significant bit; ith level -> ith bit from left • Search: like BST search, but compare appropriate bit • Insert: ditto • Note: not inorder! • Each key is somewherealong the path specifiedby its bits… • Can’t support sort, select • Search time? O(b), b=# of bits

  9. Digital Search Tree Insertion • How to insert Z? • Z=11010 • Trace down bitsuntil you find anempty spot Runtime? O(b), b=number of bits

  10. Trie • How can we keep the BST order? • Trie: a binary tree withkeys at the leaves: • for an empty setis a null pointer • for a single key a leaf containing it • for many keys, a node with keys starting with bit 0 in its left subtree andnodes starting with 1 in its right subtree

  11. Trie Insertion • Perform search as usual. • If search ends at null link, insert there • If the search ends on a leaf, we need to add enough nodes on the way down to differentiate the leaf and the inserted node • Runtime? O(b) -- or maybe better! • Inserting N random bitstrings requires lg N bit comparisons on average per insertion • Note that leaf nodes and internal nodes are different. Wasted space if we use only one sort. (This gets especially significant in a large radix!) • Even with different node types, there may be wasted space

  12. R-way Tries • You can save search time by using a larger radix (at the expense of wasted space…) • For example, have 26 children of each node, one for each letter of the alphabet

  13. Tries for strings • 26 pointers per internal node, one for each letter of the alphabet • What if one word is the prefix of another? • Example aardvark and aardvarkish • How do you represent that “aardvark” is a word if that node’s ‘i’ pointer points to another internal node? • Add a bit per letter which means “this is a word” • Keys are stored implicitly – by the sequence of links taken to find it.

  14. A Trie node for strings struct node { char isword[26]; node *links[26]; node() { for (int i=0; i<26; i++) { isword[i]=0; links[i]=0; } } }; But where is the word stored?

  15. Insertion • How do you insert a string into a trie? void insert(string word, node *n, int pos) { if (pos == word.size() - 1) { n->isword[index(word[pos])] = 1; return; } if (n->links[index(word[pos])] == NULL) n->links[index(word[pos])] = new node; insert(word, n->links[index(word[pos])], pos+1); return; } int index(char ch) { return int(ch-’a’); }

  16. Experimental Results • In my implementation a node used 132 bytes • 20068 words were read in • 45747 nodes were allocated • Total space 6,038,604 bytes (compared with 200k size of /usr/dict/words) • Average word length 7.4 characters • Average comparisons per search: 7.4 one-character comparisons (compared to 15 word comparisons for a balanced BST) • Much easier to implement than a balanced BST

  17. Using a Trie: Examples • Spell checker: fast but big • Symbol table with lots of short symbols • Boggle-playing program • read /usr/dict/words into a trie • generate a 4x4 square of random letters • DFS (or BFS) starting at each square, not re-using letters, finding all words from trie…

More Related