500 likes | 510 Views
Learn about binary trie data structure for efficient searching with variable-length keys and advantages like space-saving and independent structure. Explore examples, PATRICIA trie, sistrings, and PAT tree for flexible search. Understand prefix and proximity searching techniques.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #16 March 24, 1999
Binary Trie • A binary tree • Two types of nodes: branch and element • A branch node has two pointers, to the left and right children • An element node has the data
Binary Trie • For variable length keys no key is a prefix of another key • Add a termination character to deal with problem
Advantages • Keys may be multiple length • Search involves traversing a path to a leaf using only digit extraction. Only one comparison of two keys
Advantages • Can save space • The structure of a trie is independent of the order of insertion
Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001
Tries (multiway) • Natural extension of binary tries • Tree of degree m³2 in which branching at any level is determined by prefix or suffix
Tries (multiway) • Contains two types of nodes: branch and element nodes • Sometimes suffix trees are less deep
Prefix based trie G A B ... O U D S gull god gosh
Suffix based trie H L A B D gosh gull god
Trie with termination character t Termination character o g b to together
Compressed binary trie - PATRICIA • Branch nodes with one non null pointer are eliminated • Bit position (or displacement) added to branch nodes
Compressed binary trie - PATRICIA • A PATRICIA with n leaves, has n-1 internal nodes. • The expected height for random tree is O(log2n)
Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001
Example of a PATRICIA 1 2 3 4 4 1100 4 0000 0001 0010 0011 1000 1001
A sistring • A sistring is a semi-infinite string • A text is padded with an infinite number of null characters • No two sistrings are equal
Sistrings of a text • once there was a wizard he lived in Africa… • nce there was a wizard he lived in Africa… • ce there was a wizard he lived in Africa…
Sistrings of a text (word based) • once there was a wizard he lived in Africa… • there was a wizard he lived in Africa… • was a wizard he lived in Africa…
The PAT tree • A PAT tree is a PATRICIA tree constructed over sistrings of a text
The PAT tree • PAT trees enable flexible and efficient search including: • phrases of any length • prefix search • range search • The result of a search is • a sistring, a tree, or a forest
The nodes of a PAT tree • Leaf nodes contain • a pointer to the sistring in the text
The nodes of a PAT tree • Internal nodes contain • the pointers to the left and right subrees, • the bit position and • possibly the number of sistrings in their trees
0 01 0110 1,2 1 1,3 1 2 1 2,2 2 3 Generation of A PAT tree Bit number Number of sistrings 01100100010111… 12345678
Generation of A PAT tree 01100100010111… 1234 0110010 12345 1,5 1,4 same 2,3 2,2 3,2 2,2 4 2 4 1 5 1 3
Generation of A PAT tree 01100100010111… 12345678 011001000 1234567 1,7 1,6 2,4 same same 2,4 3,2 3,2 4,2 2 6 3 7 4 5 1
A PAT tree - first 8 sistrings 1,8 2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 01100100010111… 12345678 4 8
Prefix search • Find all sistrings with prefix s • A path of branch nodes is traversed until: • Internal node reached or • A leaf
Prefix search - internal node • If skipped bits then compare a sistring in the subtree to the prefix. If they are not equal there are no hits. • Otherwise, the number of hits is contained in the node. • The answer is the subtree whose root is the internal node
Prefix search - reached external node • Compare the prefix with the sistring • Clearly the complexity is bound by the height of the tree
Example of a prefix search • q=101 • Go right, and left • Get to node labeled 4,2 • Exhausted prefix • Skipped bit 3 • sistring 6 starts with 100 • 0 hits
Example of a prefix search • q=00101 • Left, left, right, (skip bit 4) and right • reach sistring 8 • 00101=00101… • A hit
Proximity searching • Find s1 at most q bits (characters or words) from (before or after) s2. • find “table” within 4 words from “manners” • Search the PAT tree for s1 • Search the PAT tree for s2
Proximity searching • Sort the smaller answer by sistring-index • Traverse the unsorted answer. • For each sistring-index in unsorted answer, • binary search the sorted one, • if successful print sistring-index for s1
Analysis • Assume m1 answers for s1 and m2 for s2. Assume m1< m2 • To find the answers requires: O(m1logm1) to sort, and O(m2logm1) to search the sorted array
“s1 within 4 of s2”, m1<m2 pointers to s2 2 57 22 99 80 61 Sort for s1 5 13 21 55 The hits are 5 55 21 s1 5 21 13 55 The binary search is for s1 s.t. pointer-s2 - 4<= pointer-s1 <= pointer-s2 + 4
PAT arrays • Array in which external nodes are ordered lexicographically by their sistrings. • For our PAT tree example we would get the array 7 4 8 5 1 6 3 2.
PAT arrays • Same order that would be generated by preorder traversal of the PAT tree
Searching the PAT array • Binary search to find the index j of the PAT-array s.t. sistr(PAT[j-1])< s<=sistr(PAT[j]) • This would take O(logn) string comparisons • If s=sistr(PAT[j]) a hit, otherwise no hits
Searching the PAT array • Note that a PAT array does not contain “size of subtree”. • To find the number of hits another search needs to be done for k, s.t. sistr (PAT[k])<=s<sistr(PAT[k+1]) • Number hits k-j+1
Range search • Search for smallest r1 s.t. sistr(PAT [j-1])< r1<= sistr(PAT[j]) • Search for largest r2 s.t. sistr(PAT[k])<= r2< sistr(PAT[k+1]) • All sistrings with index i=r1,..,r2
Range search • The two searches can be combined in a single binary search. • The computation time is at most 2logn comparisons, and 2 disk accesses for each comparison. • One disk access to get sistring-index and one to get sistring
Building PAT arrays • If enough memory, array can be built internally. • The memory should be large enough to store both text and index.
Building a large PAT array • Use internal memory for building PAT arrays for the largest chunks of text • Merge PAT arrays
Merging a small and a large PAT array • The large array of size n2 is stored on disk. • The small one of size n1 is stored in main memory
Merging a small and a large PAT array • Count number of sistrings sL on large array between pairs of sistrings in small array • Let PATS denote the small array count[j]= |{sL|sistr(PATS[j-1])<sL<= sistr(PATS[j])}|
The algorithm (1) while sistring-indexes in large PAT do read next sistring-index pointing to sL do a binary search on the PATS to find j s.t. sistr(PATS[j-1])<sL<= sistr(PATS[j] count[j]=count[j]+1 endwhile
The algorithm (2) for j=1 to n1do read count[j] sistring-indexes from large array and write on merged-file write PATS[j] on merged-file endfor read count[n1+1] sistring-indexes from large array and write on merged-file
Analysis • Memory large enough to store: text + small PAT-array + count array. • I/O operations - reading large array twice O(n2) and writing the merged-array O(n1+n2) • Operations in internal memory O(n2logn1)
Example Count array Large PAT-array Small PAT-array 4 1 3 2 1 1 2 5 6 <=s5 <=s6 >s6 4 5 1 6 3 2 Merged array
Building a large PAT array • The small file is of length n/m • The large file is of lengths n/m, 2n/m, 3n/m, …, (m-1)n/m • The algorithm does O(nm) I/Os and O(nmlog(n/m)) internal operations