Design a Data Structure

Design a Data Structure • Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) • index say 100,000,000 documents of 1000 words each  100 billion word occurrences. • average word length: 8 • speed determined largely by disk accesses • may want boolean searches (e.g. “banana” and “slug”) • order results by relevance (title, keywords, repetitions…) • what data structure, algorithms? • what will the space requirements of your data structure be? • what will the time requirements be?

Search Engine Ideas • Binary search tree • With a node for each word occurrence, memory needed: 100 billion nodes, 20-30 bytes each? • Insert, delete, find O(log n) – would that be OK? • Or one node for all occurrences of a word, with a linked list of pointers to documents? • perhaps 10 million nodes, each with a 10,000 element list? • keep nodes (but not lists) in RAM • each element of list has URL, title, excerpt – 8K bytes? • How about a list of documents with excerpts. • 1. Banana Slugs, http://…, “Banana slugs are yellow, 8” long…” • 8K per document would be 800 GB for the whole list.

Getting results • What should we store at the nodes of the BST? • A “hit list” for a word? 10000 entries? • Store a pointer to a hit list instead, to minimize BST size • For each hit store document number and byte offset • Order hit list by relevance criteria • Size of hit list: 8GB? • How many disk accesses to find the hits in a BST? • At 100 million * 20-30 bytes per node, the BST is large. Can we store it all in RAM? • How to perform a Boolean search? • or: union two lists (merge) • and: intersect two lists (merge-like algorithm) • Total disk accesses needed? • search BST + access hit list + access each document’s info

A Better Data Structure • BSTs waste space. Much duplication in the keys • BSTs waste comparison time, for the same reason • Can we use the ideas of Radix Sort? • Search by bit? or by letter? • Build a search tree, but… • Go left if first bit is 0, right for 1 • Or, nodes have 26 children, for a..z • Words at the leaves. (Different sort of node.) • Each leaf node is a “hit list” • Don’t need to store the words! • How much space is needed? • suppose you have all 11.9M 5-letter words. • space for tree about 1 pointer per word, 4 bytes, vs. 20(?) in BST • Space savings possible--but what about wasted pointer space?

Radix Search (Ch. 15) • Radix-search methods provide reasonable worst-case performance without balanced-tree complexity • Space savings are also possible. • They work by comparing pieces (“bytes”) of the key rather than the whole key, as in a BST • Analogous to Radix Sorting methods • Called “tries” for retrieval (but, ironically, pronounced like the word tries)

Symbol Tables (Ch. 12 quickie) • But first, a word about symbol tables and BSTs (review) • Symbol table: store items. retrieve them by key. • e.g. a compiler’s symbol table • e.g. a database with primary key • e.g. Perl’s hash data structure (essentially an array indexed by a word.) $phone{“john”} = “x6789”. • fundamental to much of computation • Symbol table ADT (with additional desirable ops): • insert, delete, find • select (kth largest) • sort • union (of two symbol tables) • Extensively studied and still an area of active research(eg web)

BSTs for Symbol Tables • The Binary Search Tree is a common data structure used to implement symbol tables • Operations: • insert, delete, find – recursive algs, O(n) worst case • O(log n) worst case in balanced BSTs • sort – inorder traversal • O(n) • kth largest? • augment tree with number of descendants stored at each node • O(log n) time in a balanced BST • pred, succ? • union?

Digital Search Trees (Ch. 15 again) • Like a BST, but go left for 0, right for 1 in the bit in question • Store key at node • Root is most significant bit; ith level -> ith bit from left • Search: like BST search, but compare appropriate bit • Insert: ditto • Note: not inorder! • Each key is somewherealong the path specifiedby its bits… • Can’t support sort, select • Search time? O(b), b=# of bits

Digital Search Tree Insertion • How to insert Z? • Z=11010 • Trace down bitsuntil you find anempty spot Runtime? O(b), b=number of bits

Trie • How can we keep the BST order? • Trie: a binary tree withkeys at the leaves: • for an empty setis a null pointer • for a single key a leaf containing it • for many keys, a node with keys starting with bit 0 in its left subtree andnodes starting with 1 in its right subtree

Trie Insertion • Perform search as usual. • If search ends at null link, insert there • If the search ends on a leaf, we need to add enough nodes on the way down to differentiate the leaf and the inserted node • Runtime? O(b) -- or maybe better! • Inserting N random bitstrings requires lg N bit comparisons on average per insertion • Note that leaf nodes and internal nodes are different. Wasted space if we use only one sort. (This gets especially significant in a large radix!) • Even with different node types, there may be wasted space

R-way Tries • You can save search time by using a larger radix (at the expense of wasted space…) • For example, have 26 children of each node, one for each letter of the alphabet

Tries for strings • 26 pointers per internal node, one for each letter of the alphabet • What if one word is the prefix of another? • Example aardvark and aardvarkish • How do you represent that “aardvark” is a word if that node’s ‘i’ pointer points to another internal node? • Add a bit per letter which means “this is a word” • Keys are stored implicitly – by the sequence of links taken to find it.

A Trie node for strings struct node { char isword[26]; node *links[26]; node() { for (int i=0; i<26; i++) { isword[i]=0; links[i]=0; } } }; But where is the word stored?

Insertion • How do you insert a string into a trie? void insert(string word, node *n, int pos) { if (pos == word.size() - 1) { n->isword[index(word[pos])] = 1; return; } if (n->links[index(word[pos])] == NULL) n->links[index(word[pos])] = new node; insert(word, n->links[index(word[pos])], pos+1); return; } int index(char ch) { return int(ch-’a’); }

Experimental Results • In my implementation a node used 132 bytes • 20068 words were read in • 45747 nodes were allocated • Total space 6,038,604 bytes (compared with 200k size of /usr/dict/words) • Average word length 7.4 characters • Average comparisons per search: 7.4 one-character comparisons (compared to 15 word comparisons for a balanced BST) • Much easier to implement than a balanced BST

Using a Trie: Examples • Spell checker: fast but big • Symbol table with lots of short symbols • Boggle-playing program • read /usr/dict/words into a trie • generate a 4x4 square of random letters • DFS (or BFS) starting at each square, not re-using letters, finding all words from trie…

Design a Data Structure