Data Structures & Algorithms Radix Search

Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Radix-based Keys • Key has multiple parts • Each part is an element of some set • Character • Numeral • Key parts can be accessed (e.g., string s[i]) • Size of set is radix

Advantages of Radix-based Search • Good worst-case performance • Simpler than balanced trees, etc. • Fast access to data • Easy way to handle variable-length keys • Save space (part of key in structure)

Disadvantages of Radix-based Search • May be space-inefficient • Performance depends on access to bytes of keys • Must have distinct keys, or other way to handle duplicate keys

Digital Search Trees • Similar to binary search trees • Difference is that we use bits of the key to determine subtree to search • Path in tree = prefix of key

Digital Search Trees • Insert A-S-E-R-C-H-I-N-G A Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 0 1 E S 0 1 0 1 R C H 1 0 1 0 1 0 G N I 0 1 0 1 0 1 Note that binary tree is not sorted in BST sense

Digital Search Trees Prop 15.1: A search or insertion into a DST takes about lg N comparisons on average, and about 2 lg N comparisons in the worst case, in a tree built from N keys. The number of comparisons is never more than the number of bits in the search key.

Tries • Use bits of key to guide search like DST • But keep keys in order like BST • Allow recursive sort, etc. • Pronounced “try-ee” or “try” • Keys kept at leaves of a binary tree

Tries • Defn. 15.1: A trie is a binary tree that has keys associated with each leaf, defined as follows: • a trie for an empty set is a null link • a trie for a single key is a leaf w/key • a trie for > 1 key is an internal node with left link referring to trie for keys that start with 0, right for keys 1xxx

Tries • Insert A-S-E-R-C-H-I-N-G Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 A 0 1 A S Construct tree to point where prefixes match

Tries • Insert A-S-E-R-C-H-I-N-G Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 A 0 1 S A 0 1 0 1 0 1 0 1 A E 0 1 0 1 R S Construct tree to point where prefixes match

Tries • Insert A-S-E-R-C-H-I-N-G Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 0 1 0 1 0 1 H 0 1 0 1 A E 0 1 0 1 0 1 C A R S

Tries • Insert A-S-E-R-C-H-I-N-G Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 0 1 0 1 0 1 H 0 1 0 1 0 1 E 0 1 0 1 0 1 0 1 0 1 C A R S H I

Tries • Prop. 15.2: The structure of a trie is independent of key insertion order; there is one unique trie for any given set of distinct keys. • Prop. 15.3: Insertion or search for a random key in a trie built from N random keys takes about lg N bit comparisons on average, in the worst case, bounded by bits in key

Tries • Annoying feature of tries: • One-way branching when keys have common prefix • Prop. 15.4: A trie built from N random w-bit keys has about N/lg 2 nodes on the average (about 1.44 N)

Patricia Tries • Annoying feature of tries: • One-way branching when keys have common prefix • Two different types of nodes in trie • Patricia tries: fix both of these • Practical Algorithm To Retrieve Information Coded In Alphanumeric

Patricia Tries • Avoid one-way branching: • Keep at each node the index of the next bit to test • Skip over common prefix! • Avoid two types of nodes: • Store data in internal nodes • Replace external links with back links

Patricia Tries Key Repr A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 0 S 1 4 H R 2 E 3 C 4 A

Patricia Tries • Prop 15.5: Insertion or search in a patricia trie built from N random bitstrings takes about lg N bit comparisons on average, and about 2 lg N in the worst case, but never more than the length of the key.

Map • Radix search • Digital Search Trees • Tries • Patricia Tries • Multiway tries and TSTs • Text string algorithms

Multiway Tries • Like radix sort, can get benefit from comparing more than one bit at a time • Compare r bits, speed up search by a factor of r • What could possibly be bad? • Number of links is now R=2r • Can waste a lot of space!

Multiway Tries • Structure is (almost) the same as binary tries • Except there are R branches • Search: start at root, leftmost digit • Follow ith link if next R-ary digit is i • If null link, then miss • If reach leaf, it contains only key with prefix matching path to it - compare

Existence Tries • Only keys, no records • Insert/search • Defn. 15.2: The existence trie for a set of keys is: • Empty set: null link • Non-empty set: internal node with links for each possible digit to tries built with the leading digit omitted

Existence Tries • Convenient to return null on miss, dummy record on hit • Convenient to have no duplicate keys and no key a prefix of another key • Keys of fixed length, or • Use termination character with value NULLdigit, only used as sentinel

Existence Tries • No need to store any data • All keys captured in trie structure • If reach NULLdigit at the same time we run out of key digits, search hit • Otherwise, search miss • Insert: search until find null link, then add nodes for each of the remaining digits in the key

Existence Tries now is the time for a t i n f h i s o o m e w r e

Multi-way Tries • R-ary branching • Keys stored at leaves • Path to leaf defines prefix of key stored at leaf • Only build tree downward until prefixes become distinct

Multi-way Tries • Defn. 15.3: The multiway trie for a set of keys associated with leaves is: • Set empty: null link • Singleton set: leaf with key • Larger set: internal node with links for each possible digit to tries built with the leading digit omitted

Multi-way Tries • Prop. 15.6: Search or insertion in a standard R-ary trie takes built from N random keys takes about logR N character comparisons, bounded by the length of the key; the number of links is about RN/ln R. • Classic time-space tradeoff! • Larger R = faster but more space

Ternary Search Trie (TST) • Each node has a character (digit) and three links • Left link refers to subtrie with current key digit less than that of the node • Middle link refers to subtrie with current key digit the same • Right link refers to subtrie with current key digit greater than node’s

Ternary Search Trie (TST) • TST equivalent to BST that used characters for non-null links as keys • Like 3-way radix sorting • BSTs like QuickSort • M-ary tries like RadixSort

Ternary Search Trie (TST) • Search: start at root • Recursively – • Compare next character in key with character in node • If less, take left link • If greater, take right link • If equal, take middle and go to next character in key • Miss if encounter null link or reach end of key before NULLdigit

Ternary Search Trie (TST) • Insert: start at root • Search – • Find location where prefix diverges • Add new nodes for characters not consumed by search

Existence TST now is the time for n i t o f s h w i o e m r e

Ternary Search Trie (TST) • Prop. 15.7: A search or insertion in a full TST requires time proportional to the key length. The number of links in a TST is at most three times the number of characters in all the keys.

Ternary Search Trie (TST) • Can make more space efficient by • putting keys in leaves at point where prefix is unique, and • eliminating one-way branching as we did in Patricia Tries. • Can compromise speed and space by having large branch at root (R or R2) and rest of trie is regular TST. • Works well if first char(s) well-distributed

Ternary Search Trie (TST) • Nice for practical use • Adapt to non-uniformity often seen • Though character set may be large, often only a few are used, or are used after a particular prefix • Don’t make many links we don’t need • Structured format keys • May have many symbols used • But only a few at each part of key

Ternary Search Trie (TST) • Nice for practical use • Search misses are really fast! • Can adapt for partial match searches • “Don’t care” characters in search key • Can adapt for “almost match” searches • All but (any) one character match • Access bytes or larger symbols rather than bits (like Patricia tries), which are often better supported/efficient, or more natural to the keys

Text-String-Index • Recall String Index built with BST with string pointers into a large text • Consider each position in text to be start of a string key that runs to the end of the text • Build a symbol table with these keys • Keys are all different (lengths alone suffice) • Most are very long • Suffix Tree = search tree for this

Text-String-Index • BSTs are simple and work well for suffix trees • Not likely to be a worst-case BST • Patricia tries designed to do this! • Need to have bit-level access • Fast on misses • TSTs • Simple, take advantage of byte ops • Can solve more complex problems • Can change == to mean “prefix”

Text-String-Index • If text is static, why not use Binary Search? • Fast • No need to support insert/delete • Uses less memory (fewer links/pointers) • But TSTs have some advantages • Never retrace steps • Support other operations • Can also build FSM.. • But better for linear search of new text

String Search • If problem is to look for a particular string s in a large text t • Naïve method: • Search t linearly for s[0] • When match found at t[i], • Match s[j] with t[i+j] for j = 1 to |s|-1 • If all |s| chars match, have a match! • Else go back to searching t at t[i+1] • Time? • |s| times |t| - not good

FSM-based String Search • Fast way to look for a particular string s in one or more (large) texts: • Build FSM for search string • States represent prefix matched • Transition either extends match or • Fails to longest suffix of what has been seen that is a prefix of s • Can also build for multiple search strings

Finite State Machine a.k.a. Finite State Automaton (FSA) Set of States S – represented as nodes in graph Set of input symbols S – labels on directed edges Transition function d – for state and input, next state Initial state q0 – where to start Final set of states F – subset of S for “accept” Edge=transition d a,b,d any q1 q1 c a Start state d(q1,c)=q3 c q0 q3 a,b,d b c S = {a,b,c,d} q2 q2 F={q1,q2}

FSM-based String Search Search for abraca Build recognizer skeleton Add suffix-is-prefix links Add failure links Not a a b Final state a b r a c a a ab abr abra abrac abraca Start state a a a b else Is that all of them?

FSM-based String Search • Linear time in |s| to build FSM for s • Linear time (in |t|) to search large text t for all instances of s • Can’t hope for better than that! • What about searching for more than one string? • Build FSM for all the strings! • Linear time in sum of string lengths to build FSM • Linear time in |t| to search all of t for all strings

Summary • Radix search • Digital Search Trees • Tries • Patricia Tries • Multiway tries and TSTs • Text string algorithms • FSMs for fast string matching

Data Structures &amp; Algorithms Radix Search

Data Structures &amp; Algorithms Radix Search

Presentation Transcript

Data Structures & Algorithms Radix Search

Data Structures & Algorithms Radix Search