220 likes | 358 Views
Document Retrieval Problems. S. Muthukrishnan. Storyline. Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching ….. Update on the status of open problems. Eric Allender invited me to give a string matching talk at Rutgers U.
E N D
Document Retrieval Problems S. Muthukrishnan
Storyline • Zvi Galil gave a talk on the 13th on 13 open problems he posed 13 years ago in string matching ….. • Update on the status of open problems. • Eric Allender invited me to give a string matching talk at Rutgers U. • Gives me a chance to look through 30 years of history. • History may be divided into three movements: • what moves rapidly, • what moves slowly,and • what appears not to move at all. Fernand Braudel
Muthu: Use this problem to frame the discussion, TheKey Problem • Given a set of documents D to be preprocessed, query is to list all the locations in the documents where a given pattern occurs. occurrence listing • Given a set of documents D to be preprocessed, query is to list all the documents in which a given pattern occurs. documentlisting D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bc P= aa O={ (1,1), (1,4), (2,3), (2,4) } D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bc P= aa O={ 1, 2}
Occurrence Vs Document Listing • Given n documents of total length N, occurrence listing can be solved with • O(N) preprocessing and. • O(m + output) time for query pattern of size m. • Elegant 1973 paper by Weiner introduced suffix trees and solved this problem – optimal, output sensitive. • No such optimal result for document listing. • O( (m+out) log n ) time query processing. • log n loglog n by fractional cascading. muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:
Muthu: Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns. Other Document Listing Problems • Find all document that contain at least K occurrences of the given pattern. (mining) • Find all documents that contain two occurrences of the pattern separated by at most distance d. (proximity repeat) • Find all documents that do NOT contain the given pattern. (negative query) • Find all documents that contain pattern P but not Q. (boolean query) • Combinations thereof…
Nature of Document RetrievalProblems • Document listing versions are natural. • Occurrence listing versions primarily studied in Computational Biology and Data Mining. • No optimal algorithms previously known. • Bounds are off by factors of log n … n in the worst case depending on the problem. • We will provide (near) optimal algorithms. • Optimal algorithm for key document listing problem. Muthu: Motivated the discussion with this problem, It is also framed in history. • Theory following Practice? • Inverted word index + variants, in IR.
Talk Overview • Optimal algorithm for the document listing problem. • List all documents that contain the given pattern. • Efficient algorithm for the document mining problem. • List all documents that contain at least K occurrences of the given pattern. • Techniques. • Colored range query data structural problems.
Preamble: Occurrence Listing • Construct a suffix tree (compressed trie) of all the documents. D= {abaa, aabaa, bc } S = {abaa#, baa#, aa#, a#, aabaa#, bc#, c#} c# a b (3,2) c# # (1,4),(2,5) a (3,1) aa# # (1,2),(2,3) (1,3),(2,4) baa# baa# (1,1),(2,2) (2,1) http://commfaculty.fullerton.edu/lester/writings/1000_words.html
Preamble: Occurrence Listing • Find all occurrences of pattern aa. • Trace down the path aa and report all the leaves [Weiner 73]. c# a b (3,2) c# # (1,4),(2,5) a (3,1) aa# # Input: D= {abaa, aabaa, bc } Output: (1,3), (2,4), (2,1) (1,2),(2,3) (1,3),(2,4) baa# baa# (1,1),(2,2) (2,1)
Document Listing • Find all documents that contain pattern aa. • Trace down the path aa and report the distinct “colors” on leaves. c# a Input: D= {abaa, aabaa, bc } Output sought: 1, 2 b 3 c# # a muthu: Use hot pink sparingly 1, 2 3 aa# # 1, 2 1, 2 baa# baa# Challenge: Avoid reporting duplicate colors. 1, 2 2 Colors:1, 2, 3
Document Listing: Our Approach c# a b 3 c# # a 3 aa# 1, 2 # 1, 2 1, 2 baa# baa# Colored range query: Return distinct colors in given range. 1, 2 2 1 2 1 2 2 1 2 1 2 33 Mathematics is the art of giving the same name to different things. --- Jules Henri Poincare
Document Listing: Our Approach List distinct colors 1 2 3 4 5 6 7 8 9 10 11 1 2 1 2 2 1 2 1 2 33 1 2 3 4 5 6 7 8 9 10 11 -1 -1 1 2 4 3 5 1 7 -110 List numbers less than 3. Colors do not matter anymore.
Document Listing: Our Approach List numbers less than 3. 1 2 3 4 5 6 7 8 9 10 11 -1 -1 1 2 4 3 5 1 7 -1 10 R = (l,r). Find all integers smaller than x in A[l,r]: 1. Perform rangemin(R) to determine i such that A[i] is smallest in A[l,r]. 2. If A[i] is smaller than x, recurse on A[l,i-1] and A[i+1,r] and return A[i]. O(1) time per rangemin query O(output) time.
Muthu: Now, let us get started with fun stuff. Document Listing: Summary • Given a set of documents of total size N, document listing problem can be solved in • O(N) time and space for preprocessing, and. • O(m + output) time for a query of size m. • Uses Weiner’s O(N) time suffix tree construction. • Overview of techniques • Reduce the problem to colored range searching. • “Chain” occurrences of suffixes from each document, Necessity is not necessarily the mother of invention. Ruth Benedict in Patterns of Culture.
Document Mining • Find all documents that contain at least K occurrences of given pattern. Find colors that appear at least K times in this range.
Document Mining: First Approach • Fix K. Chain to the Kth occurrence of red to the left. Given range [l,r], determine all numbers in A[l,r] that are less than l. Yesterday it worked Today it is not working Windows is like that. Does not work:output * K
Document Mining: Second Approach • Given a set of colored intervals to be preprocessed, query is some interval I and we must determine the distinct colored intervals that are contained in I. Chain to the Kth occurrence of red to the left. Replace by red intervals. No optimal results known
Document Mining: Fixed K R L Mark Least Common Ancestor (L,R) with red color. Each query Find the set of distinct colors in a subtree. O(N) preprocessing, O( m + output) time per query
Document Mining: Variable K • K is part of the query: o(NK) preprocessing? muthu: that deserves the hot pink. 1 2 3 K K+1 K+2 2K-1 • For a fixed K, all LCAs lie in paths separated by K occurrences. • Suffices to keep the lowest in each path.
Document Mining: Variable K • For a fixed K, find the lowest LCA on each of the paths separated by O(K) occurrences of each document. • Preprocessing time: bin searching paths. • Query processing in O(m + output) time.
muthu: Hope that whetted your appetite for algorithmics. Summary • Solving other document listing problems. • Optimal for negative query: list absent colors. • (Near) optimal for proximity repeats: structural properties of “gaps.” • Best known for two patterns: breaking the quadratic preprocessing bottleneck. • Techniques: Chaining, Colored range queries (7+ such problems in the paper), Combinatorial structure. Muthu: Solving these colored range searching problems are of independent interest….
Discussion • “non” local chaining? • Find documents in which no two occurrences of the pattern are within distance K. OPEN • Try it in IPScope: Interactive Patents Analysis System.