590 likes | 604 Views
CS533 Information Retrieval. Dr. Michal Cutler Lecture #23 May 2, 2000. An alternative data structure to using inverted files. In this class: Binary and multiway tries Patricia trees (also called suffix trees) PAT arrays (also called suffix arrays). Suffix trees and arrays.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #23 May 2, 2000
An alternative data structure to using inverted files • In this class: • Binary and multiway tries • Patricia trees (also called suffix trees) • PAT arrays (also called suffix arrays)
Suffix trees and arrays • Suffix arrays and/or trees are an alternative to using inverted files • They enable efficient answers to some complex queries that would be hard to do with inverted files • Any length phrase queries, range queries, prefix queries, proximity queries, and sometimes part of word queries
Suffix arrays and trees • The main drawback of suffix arrays and trees are: • the text must be available at query time • query results are not given in text order • Enable complex queries for databases such as genetic ones, the bible, Shakespeare, etc. • Suffix (PAT) arrays are a space efficient implementation of suffix trees (PAT trees)
Binary Trie • A trie is a binary tree with: • Two types of nodes: branch and element • A branch node has a left child, or a right child or both • An element node is a leaf, containing the whole key and additional data
Binary Trie • In a trie with variable length keys no key can be a prefix of another key • Adding a termination character to each key ensures this condition (to and together can now be stored)
Advantages • Keys may be variable length • Search involves traversing a path to a leaf using only digit extraction of the search key, and choosing a left child for 0 or a right child for 1 • Only one comparison of the search key and the element is needed
Advantages • Can save space • The structure of a trie is independent of the order of insertion
Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001 Search: Left child if bit is 0, right child if 1
Tries (multiway) • Natural extension of binary tries • Tree of degree m³2 in which branching at any level is determined by either a digit of the prefix of the search key, or the suffix of the search key
Tries (multiway) • Contains two types of nodes: branch and element nodes • A branch node may have a child for all possible values of a “digit” • Sometimes suffix trees are less deep
Prefix based trie G A B ... O U D S gull god gosh
Suffix based trie H L A B D gosh gull god
Trie with termination character t Termination character o g b to b together b Each element node contains the termination character
Compressed binary trie - PATRICIA • Branch nodes with one non null pointer are eliminated • Bit position (or displacement) added to branch nodes
Compressed binary trie - PATRICIA • A PATRICIA with n leaves, has n-1 internal nodes. • The expected height for random tree is O(log2n)
Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001
Example of a PATRICIA 1 2 3 4 4 1100 4 0000 0001 0010 0011 1000 1001
An application of Patricia • Create a “dictionary” of every string / or every word in a text • This is very useful for texts such as the bible, or Shakespeare • Allows searching for prefixes of words, phrases, sentences, etc.
A sistring • The text is viewed as containing a sequence of sistrings • A sistring is a semi-infinite string • The text is virtually padded with an infinite number of null characters • So no two sistrings are equal • Same idea as the terminating character
Sistrings of a text • May be character base as follows: • once there was a wizard he lived in Africa… • nce there was a wizard he lived in Africa… • ce there was a wizard he lived in Africa…
Sistrings of a text (word based) • once there was a wizard he lived in Africa… • there was a wizard he lived in Africa… • was a wizard he lived in Africa…
A PAT tree • A PAT tree is a PATRICIA tree constructed over sistrings of a text
The PAT tree • PAT trees enable flexible and efficient search including: • phrases of any length • prefix search • range search • The result of a search is • a sistring, a tree, or a forest
The nodes of a PAT tree • Leaf nodes contain • a pointer to the sistring in the text
The nodes of a PAT tree • Internal nodes contain • the pointers to the left and right subrees, • the bit position and • possibly the number of sistrings in the tree whose root is the node
Building a PAT tree • The following slide assumes a binary “text”, and shows how a PAT tree is built sistring by sistring • The initial tree is an element node pointing to the sistring that starts at the first character, i.e., 1. • Each node contains: the bit number of the key that must be extracted, and the number of sistrings in the tree
0 01 0110 1,2 1 1,3 1 2 1 2,2 2 3 Generation of A PAT tree Bit number Number of sistrings Text Pointer to sistring The text 01100100010111… 12345678 Sistring number
Generation of A PAT tree 01100100010111… 1234 0110010 12345 1,5 1,4 same 2,3 2,2 3,2 2,2 4 2 4 1 5 1 3
Generation of A PAT tree 01100100010111… 12345678 011001000 1234567 1,7 1,6 2,4 same same 2,3 3,2 3,2 4,2 2 6 3 7 4 5 1
A PAT tree - first 8 sistrings 1,8 2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 01100100010111… 12345678 4 8
Prefix search • Find all sistrings with prefix s • A path of branch nodes is traversed until: • Internal node corresponding to prefix reached or • A leaf
Prefix search - internal node • If skipped bits then compare a sistring in the subtree to the prefix. (requires traversing to a leaf) If they are not equal there are no hits. • Otherwise, the number of hits is contained in the node. • The answer is the subtree whose root is the internal node
Prefix search - reached external node • Compare the prefix with the sistring • Clearly the complexity is bound by the height of the tree
Example of a prefix search • q=101 • Go right, and left • Get to node labeled 2,3 • Skip bit 3 and exhaust prefix • Reach the internal node (4,2) • sistring 6 starts with 100 • 0 hits
Q=101 1,8 1 2,3 2,5 0 2 4,2 3,3 3,2 7 1 6 3 5 5,2 Skipped bit 3 01100100010111… 12345678 4 8
Example of a prefix search • q=00101 • Left, left, right, (skip bit 4) and right • reach sistring 8 • 00101=00101… • A hit
Q=00101 1,8 0 2,3 2,5 0 2 4,2 3,3 3,2 1 7 1 6 3 5 5,2 1 01100100010111… 12345678 Skipped bit 4 4 8
Word based PAT tree 1 2 3 4 5 6 7 8 9 “once there was a wizard he lived in Africa” 1,9 a w t h i 2,2 8 in 6 he 2 there 2,2 l o b f a i 7 lived 1 once 4 ab 9 Africa 3 was 5 wizard
Proximity searching • Find s1 at most q bits (characters or words) from (before or after) s2. • find “table” within 4 words from “manners” • Search the PAT tree for s1 • Search the PAT tree for s2
Proximity searching • Sort the smaller answer by sistring-index • Traverse the unsorted answer. • For each sistring-index in unsorted answer, • binary search the sorted one, • if successful print sistring-index for s1
Analysis • Assume m1 answers for s1 and m2 for s2. Assume m1< m2 • To find the answers requires: O(m1logm1) to sort, and O(m2logm1) to search the sorted array
“s1 within 4 of s2”, m1<m2 pointers to s2 2 57 22 99 80 61 Sort for s1 5 13 21 55 The hits are 5 55 21 s1 5 21 13 55 The binary search is for s1 s.t. pointer-s2 - 4<= pointer-s1 <= pointer-s2 + 4
2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 4 8 PAT arrays • Array in which external nodes are ordered lexicographically by their sistrings. • For our PAT tree example we would get the array 7 4 8 5 1 6 3 2.
PAT arrays • Same order that would be generated by preorder traversal of the PAT tree
Searching the PAT array • Binary search to find the index j of the PAT-array s.t. sistr(PAT[j-1])< s<=sistr(PAT[j]) • This would take O(logn) string comparisons • If s=sistr(PAT[j]) a hit, otherwise no hits
Searching the PAT array • Note that a PAT array does not contain “size of subtree”. • To find the number of hits another search needs to be done for k, s.t. sistr (PAT[k])<=s<sistr(PAT[k+1]) • Number hits k-j+1
Range search (r1-r2) • Search for smallest r1 s.t. sistr(PAT [j-1])< r1<= sistr(PAT[j]) • Search for largest r2 s.t. sistr(PAT[k])<= r2< sistr(PAT[k+1]) • All sistrings with index i=j,..,k
Range search • The two searches can be combined in a single binary search. • The computation time is at most 2logn comparisons, and 2 disk accesses for each comparison. • One disk access to get sistring-index and one to get sistring
Building PAT arrays • If enough memory, array can be built internally. • The memory should be large enough to store both text and index. • “Indirect sort” is performed on sistrings • O(nlog n)