1 / 59

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #23 May 2, 2000. An alternative data structure to using inverted files. In this class: Binary and multiway tries Patricia trees (also called suffix trees) PAT arrays (also called suffix arrays). Suffix trees and arrays.

hgarrido
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #23 May 2, 2000

  2. An alternative data structure to using inverted files • In this class: • Binary and multiway tries • Patricia trees (also called suffix trees) • PAT arrays (also called suffix arrays)

  3. Suffix trees and arrays • Suffix arrays and/or trees are an alternative to using inverted files • They enable efficient answers to some complex queries that would be hard to do with inverted files • Any length phrase queries, range queries, prefix queries, proximity queries, and sometimes part of word queries

  4. Suffix arrays and trees • The main drawback of suffix arrays and trees are: • the text must be available at query time • query results are not given in text order • Enable complex queries for databases such as genetic ones, the bible, Shakespeare, etc. • Suffix (PAT) arrays are a space efficient implementation of suffix trees (PAT trees)

  5. Binary Trie • A trie is a binary tree with: • Two types of nodes: branch and element • A branch node has a left child, or a right child or both • An element node is a leaf, containing the whole key and additional data

  6. Binary Trie • In a trie with variable length keys no key can be a prefix of another key • Adding a termination character to each key ensures this condition (to and together can now be stored)

  7. Advantages • Keys may be variable length • Search involves traversing a path to a leaf using only digit extraction of the search key, and choosing a left child for 0 or a right child for 1 • Only one comparison of the search key and the element is needed

  8. Advantages • Can save space • The structure of a trie is independent of the order of insertion

  9. Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001 Search: Left child if bit is 0, right child if 1

  10. Tries (multiway) • Natural extension of binary tries • Tree of degree m³2 in which branching at any level is determined by either a digit of the prefix of the search key, or the suffix of the search key

  11. Tries (multiway) • Contains two types of nodes: branch and element nodes • A branch node may have a child for all possible values of a “digit” • Sometimes suffix trees are less deep

  12. Prefix based trie G A B ... O U D S gull god gosh

  13. Suffix based trie H L A B D gosh gull god

  14. Trie with termination character t Termination character o g b to b together b Each element node contains the termination character

  15. Compressed binary trie - PATRICIA • Branch nodes with one non null pointer are eliminated • Bit position (or displacement) added to branch nodes

  16. Compressed binary trie - PATRICIA • A PATRICIA with n leaves, has n-1 internal nodes. • The expected height for random tree is O(log2n)

  17. Example of a binary trie bit 1 bit 2 bit 3 1100 bit 4 0000 0001 0010 0011 1000 1001

  18. Example of a PATRICIA 1 2 3 4 4 1100 4 0000 0001 0010 0011 1000 1001

  19. An application of Patricia • Create a “dictionary” of every string / or every word in a text • This is very useful for texts such as the bible, or Shakespeare • Allows searching for prefixes of words, phrases, sentences, etc.

  20. A sistring • The text is viewed as containing a sequence of sistrings • A sistring is a semi-infinite string • The text is virtually padded with an infinite number of null characters • So no two sistrings are equal • Same idea as the terminating character

  21. Sistrings of a text • May be character base as follows: • once there was a wizard he lived in Africa… • nce there was a wizard he lived in Africa… • ce there was a wizard he lived in Africa…

  22. Sistrings of a text (word based) • once there was a wizard he lived in Africa… • there was a wizard he lived in Africa… • was a wizard he lived in Africa…

  23. A PAT tree • A PAT tree is a PATRICIA tree constructed over sistrings of a text

  24. The PAT tree • PAT trees enable flexible and efficient search including: • phrases of any length • prefix search • range search • The result of a search is • a sistring, a tree, or a forest

  25. The nodes of a PAT tree • Leaf nodes contain • a pointer to the sistring in the text

  26. The nodes of a PAT tree • Internal nodes contain • the pointers to the left and right subrees, • the bit position and • possibly the number of sistrings in the tree whose root is the node

  27. Building a PAT tree • The following slide assumes a binary “text”, and shows how a PAT tree is built sistring by sistring • The initial tree is an element node pointing to the sistring that starts at the first character, i.e., 1. • Each node contains: the bit number of the key that must be extracted, and the number of sistrings in the tree

  28. 0 01 0110 1,2 1 1,3 1 2 1 2,2 2 3 Generation of A PAT tree Bit number Number of sistrings Text Pointer to sistring The text 01100100010111… 12345678 Sistring number

  29. Generation of A PAT tree 01100100010111… 1234 0110010 12345 1,5 1,4 same 2,3 2,2 3,2 2,2 4 2 4 1 5 1 3

  30. Generation of A PAT tree 01100100010111… 12345678 011001000 1234567 1,7 1,6 2,4 same same 2,3 3,2 3,2 4,2 2 6 3 7 4 5 1

  31. A PAT tree - first 8 sistrings 1,8 2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 01100100010111… 12345678 4 8

  32. Prefix search • Find all sistrings with prefix s • A path of branch nodes is traversed until: • Internal node corresponding to prefix reached or • A leaf

  33. Prefix search - internal node • If skipped bits then compare a sistring in the subtree to the prefix. (requires traversing to a leaf) If they are not equal there are no hits. • Otherwise, the number of hits is contained in the node. • The answer is the subtree whose root is the internal node

  34. Prefix search - reached external node • Compare the prefix with the sistring • Clearly the complexity is bound by the height of the tree

  35. Example of a prefix search • q=101 • Go right, and left • Get to node labeled 2,3 • Skip bit 3 and exhaust prefix • Reach the internal node (4,2) • sistring 6 starts with 100 • 0 hits

  36. Q=101 1,8 1 2,3 2,5 0 2 4,2 3,3 3,2 7 1 6 3 5 5,2 Skipped bit 3 01100100010111… 12345678 4 8

  37. Example of a prefix search • q=00101 • Left, left, right, (skip bit 4) and right • reach sistring 8 • 00101=00101… • A hit

  38. Q=00101 1,8 0 2,3 2,5 0 2 4,2 3,3 3,2 1 7 1 6 3 5 5,2 1 01100100010111… 12345678 Skipped bit 4 4 8

  39. Word based PAT tree 1 2 3 4 5 6 7 8 9 “once there was a wizard he lived in Africa” 1,9 a w t h i 2,2 8 in 6 he 2 there 2,2 l o b f a i 7 lived 1 once 4 ab 9 Africa 3 was 5 wizard

  40. Proximity searching • Find s1 at most q bits (characters or words) from (before or after) s2. • find “table” within 4 words from “manners” • Search the PAT tree for s1 • Search the PAT tree for s2

  41. Proximity searching • Sort the smaller answer by sistring-index • Traverse the unsorted answer. • For each sistring-index in unsorted answer, • binary search the sorted one, • if successful print sistring-index for s1

  42. Analysis • Assume m1 answers for s1 and m2 for s2. Assume m1< m2 • To find the answers requires: O(m1logm1) to sort, and O(m2logm1) to search the sorted array

  43. “s1 within 4 of s2”, m1<m2 pointers to s2 2 57 22 99 80 61 Sort for s1 5 13 21 55 The hits are 5 55 21 s1 5 21 13 55 The binary search is for s1 s.t. pointer-s2 - 4<= pointer-s1 <= pointer-s2 + 4

  44. 2,3 2,5 2 4,2 3,3 3,2 7 1 6 3 5 5,2 4 8 PAT arrays • Array in which external nodes are ordered lexicographically by their sistrings. • For our PAT tree example we would get the array 7 4 8 5 1 6 3 2.

  45. PAT arrays • Same order that would be generated by preorder traversal of the PAT tree

  46. Searching the PAT array • Binary search to find the index j of the PAT-array s.t. sistr(PAT[j-1])< s<=sistr(PAT[j]) • This would take O(logn) string comparisons • If s=sistr(PAT[j]) a hit, otherwise no hits

  47. Searching the PAT array • Note that a PAT array does not contain “size of subtree”. • To find the number of hits another search needs to be done for k, s.t. sistr (PAT[k])<=s<sistr(PAT[k+1]) • Number hits k-j+1

  48. Range search (r1-r2) • Search for smallest r1 s.t. sistr(PAT [j-1])< r1<= sistr(PAT[j]) • Search for largest r2 s.t. sistr(PAT[k])<= r2< sistr(PAT[k+1]) • All sistrings with index i=j,..,k

  49. Range search • The two searches can be combined in a single binary search. • The computation time is at most 2logn comparisons, and 2 disk accesses for each comparison. • One disk access to get sistring-index and one to get sistring

  50. Building PAT arrays • If enough memory, array can be built internally. • The memory should be large enough to store both text and index. • “Indirect sort” is performed on sistrings • O(nlog n)

More Related