1 / 50

CS533 Information Retrieval

Learn about binary trie data structure for efficient searching with variable-length keys and advantages like space-saving and independent structure. Explore examples, PATRICIA trie, sistrings, and PAT tree for flexible search. Understand prefix and proximity searching techniques.

brockman
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #16 March 24, 1999

  2. Binary Trie • A binary tree • Two types of nodes: branch and element • A branch node has two pointers, to the left and right children • An element node has the data

  3. Binary Trie • For variable length keys no key is a prefix of another key • Add a termination character to deal with problem

  4. Advantages • Keys may be multiple length • Search involves traversing a path to a leaf using only digit extraction. Only one comparison of two keys

  5. Advantages • Can save space • The structure of a trie is independent of the order of insertion

  6. Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001

  7. Tries (multiway) • Natural extension of binary tries • Tree of degree m³2 in which branching at any level is determined by prefix or suffix

  8. Tries (multiway) • Contains two types of nodes: branch and element nodes • Sometimes suffix trees are less deep

  9. Prefix based trie G A B ... O U D S gull god gosh

  10. Suffix based trie H L A B D gosh gull god

  11. Trie with termination character t Termination character o g b to together

  12. Compressed binary trie - PATRICIA • Branch nodes with one non null pointer are eliminated • Bit position (or displacement) added to branch nodes

  13. Compressed binary trie - PATRICIA • A PATRICIA with n leaves, has n-1 internal nodes. • The expected height for random tree is O(log2n)

  14. Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001

  15. Example of a PATRICIA 1 2 3 4 4 1100 4 0000 0001 0010 0011 1000 1001

  16. A sistring • A sistring is a semi-infinite string • A text is padded with an infinite number of null characters • No two sistrings are equal

  17. Sistrings of a text • once there was a wizard he lived in Africa… • nce there was a wizard he lived in Africa… • ce there was a wizard he lived in Africa…

  18. Sistrings of a text (word based) • once there was a wizard he lived in Africa… • there was a wizard he lived in Africa… • was a wizard he lived in Africa…

  19. The PAT tree • A PAT tree is a PATRICIA tree constructed over sistrings of a text

  20. The PAT tree • PAT trees enable flexible and efficient search including: • phrases of any length • prefix search • range search • The result of a search is • a sistring, a tree, or a forest

  21. The nodes of a PAT tree • Leaf nodes contain • a pointer to the sistring in the text

  22. The nodes of a PAT tree • Internal nodes contain • the pointers to the left and right subrees, • the bit position and • possibly the number of sistrings in their trees

  23. 0 01 0110 1,2 1 1,3 1 2 1 2,2 2 3 Generation of A PAT tree Bit number Number of sistrings 01100100010111… 12345678

  24. Generation of A PAT tree 01100100010111… 1234 0110010 12345 1,5 1,4 same 2,3 2,2 3,2 2,2 4 2 4 1 5 1 3

  25. Generation of A PAT tree 01100100010111… 12345678 011001000 1234567 1,7 1,6 2,4 same same 2,4 3,2 3,2 4,2 2 6 3 7 4 5 1

  26. A PAT tree - first 8 sistrings 1,8 2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 01100100010111… 12345678 4 8

  27. Prefix search • Find all sistrings with prefix s • A path of branch nodes is traversed until: • Internal node reached or • A leaf

  28. Prefix search - internal node • If skipped bits then compare a sistring in the subtree to the prefix. If they are not equal there are no hits. • Otherwise, the number of hits is contained in the node. • The answer is the subtree whose root is the internal node

  29. Prefix search - reached external node • Compare the prefix with the sistring • Clearly the complexity is bound by the height of the tree

  30. Example of a prefix search • q=101 • Go right, and left • Get to node labeled 4,2 • Exhausted prefix • Skipped bit 3 • sistring 6 starts with 100 • 0 hits

  31. Example of a prefix search • q=00101 • Left, left, right, (skip bit 4) and right • reach sistring 8 • 00101=00101… • A hit

  32. Proximity searching • Find s1 at most q bits (characters or words) from (before or after) s2. • find “table” within 4 words from “manners” • Search the PAT tree for s1 • Search the PAT tree for s2

  33. Proximity searching • Sort the smaller answer by sistring-index • Traverse the unsorted answer. • For each sistring-index in unsorted answer, • binary search the sorted one, • if successful print sistring-index for s1

  34. Analysis • Assume m1 answers for s1 and m2 for s2. Assume m1< m2 • To find the answers requires: O(m1logm1) to sort, and O(m2logm1) to search the sorted array

  35. “s1 within 4 of s2”, m1<m2 pointers to s2 2 57 22 99 80 61 Sort for s1 5 13 21 55 The hits are 5 55 21 s1 5 21 13 55 The binary search is for s1 s.t. pointer-s2 - 4<= pointer-s1 <= pointer-s2 + 4

  36. PAT arrays • Array in which external nodes are ordered lexicographically by their sistrings. • For our PAT tree example we would get the array 7 4 8 5 1 6 3 2.

  37. PAT arrays • Same order that would be generated by preorder traversal of the PAT tree

  38. Searching the PAT array • Binary search to find the index j of the PAT-array s.t. sistr(PAT[j-1])< s<=sistr(PAT[j]) • This would take O(logn) string comparisons • If s=sistr(PAT[j]) a hit, otherwise no hits

  39. Searching the PAT array • Note that a PAT array does not contain “size of subtree”. • To find the number of hits another search needs to be done for k, s.t. sistr (PAT[k])<=s<sistr(PAT[k+1]) • Number hits k-j+1

  40. Range search • Search for smallest r1 s.t. sistr(PAT [j-1])< r1<= sistr(PAT[j]) • Search for largest r2 s.t. sistr(PAT[k])<= r2< sistr(PAT[k+1]) • All sistrings with index i=r1,..,r2

  41. Range search • The two searches can be combined in a single binary search. • The computation time is at most 2logn comparisons, and 2 disk accesses for each comparison. • One disk access to get sistring-index and one to get sistring

  42. Building PAT arrays • If enough memory, array can be built internally. • The memory should be large enough to store both text and index.

  43. Building a large PAT array • Use internal memory for building PAT arrays for the largest chunks of text • Merge PAT arrays

  44. Merging a small and a large PAT array • The large array of size n2 is stored on disk. • The small one of size n1 is stored in main memory

  45. Merging a small and a large PAT array • Count number of sistrings sL on large array between pairs of sistrings in small array • Let PATS denote the small array count[j]= |{sL|sistr(PATS[j-1])<sL<= sistr(PATS[j])}|

  46. The algorithm (1) while sistring-indexes in large PAT do read next sistring-index pointing to sL do a binary search on the PATS to find j s.t. sistr(PATS[j-1])<sL<= sistr(PATS[j] count[j]=count[j]+1 endwhile

  47. The algorithm (2) for j=1 to n1do read count[j] sistring-indexes from large array and write on merged-file write PATS[j] on merged-file endfor read count[n1+1] sistring-indexes from large array and write on merged-file

  48. Analysis • Memory large enough to store: text + small PAT-array + count array. • I/O operations - reading large array twice O(n2) and writing the merged-array O(n1+n2) • Operations in internal memory O(n2logn1)

  49. Example Count array Large PAT-array Small PAT-array 4 1 3 2 1 1 2 5 6 <=s5 <=s6 >s6 4 5 1 6 3 2 Merged array

  50. Building a large PAT array • The small file is of length n/m • The large file is of lengths n/m, 2n/m, 3n/m, …, (m-1)n/m • The algorithm does O(nm) I/Os and O(nmlog(n/m)) internal operations

More Related