170 likes | 326 Views
Space-Efficient Algorithms for Document Retrieval. Veli Mäkinen University of Helsinki. Joint work with Niko Välimäki. Introduction. Solution. Problem. Field. Information Retrieval. Document Retrieval. Inverted Index. [Sad07 & this paper]. [PST06]. practice: space limits
E N D
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki
Introduction Solution Problem Field Information Retrieval Document Retrieval Inverted Index [Sad07 & this paper] [PST06] practice: space limits theory: time limits [Mut02] Combinatorial Pattern Matching Text Indexing Suffix tree Space-Efficient Document Retrieval
Text Indexing • Let T = t1t2 ... tn be a text string from an ordered alphabet Σ. • Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p1p2 ... pm: • Count(P): How many times P occurs in T? • List(P): list the occurrence positions of P in T. Space-Efficient Document Retrieval
Document Retrieval • Let D={T1,T2,...Tk} be a set of text documents of total length n. • Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p1p2 ... pm:- Find(P): List the documents that contain P (in the order of relevance,...) Space-Efficient Document Retrieval
Inverted Index & Document Retrieval To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; ... be: (d1,4) (d1,18) ... (d2,74) (d2,139)... ... to: (d1,1) (d1,15) ...(d2,136)... ... PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Creating inverted file over Shakespeare's plays............................... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval
Suffix Array & Document Retrieval (1/2) • Build generalized suffix array of D: 1 2 .... 6853491 6853492 6853493 6853494 ... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good todo, chapels had been churches and poor men'scottages princes' palaces. It is a good divine thatfollows his own instructions: I can easier teachtwenty what were good to be done, than be one of thetwenty to follow mine own teaching. Space-Efficient Document Retrieval
Suffix Array & Document Retrieval • Build generalized suffix array of D: • Locate the interval containing all occurrences of pattern P: • Remove duplicates: 1 2 .... 6853491 6853492 6853493 6853494 ... "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... d1 (Hamlet), d2 (Merchant of Venice),... Space-Efficient Document Retrieval
prev -1 -1 ...6853434 6853372 6853492 6853420 ... min min min ... min>6853490 Muthukrishnan's improvement "to be" 1 2 .... 6853491 6853492 6853493 6853494 ... 6 4 .... 2 1 1 3 doc Space-Efficient Document Retrieval
Time-Optimal Document Retrieval • Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. • Observation: The solution is not space-optimal, as the document collection can be represented in n log |Σ| bits. Space-Efficient Document Retrieval
Space-Optimal Document Retrieval • Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where • |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; • f(m,n)=O(m log n) is the pattern search time; and • Ω(logεn)=g(n) is the time to decode a suffix array value. Space-Efficient Document Retrieval
Our Result: Space- and Time-Efficient Document Retrieval • Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); • for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively. Space-Efficient Document Retrieval
Details of Our Result (1/3) • We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. • We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order. Space-Efficient Document Retrieval
Details of Our Result (2/3) • Observation: prev[i]=selectdoc[i](doc,rankdoc[i](doc,i)-1), where • rankk'(A,i) gives the number of times value k' appears in A[1,i]; and • selectk'(A,j) gives the position of the j-th occurrence of value k' in A. Space-Efficient Document Retrieval
Details of Our Result (3/3) • The generalized wavelet tree representation of doc-array provides constant time rank and select when kpolylog (n). • Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07]. Space-Efficient Document Retrieval
|CSA|+2n+o(n)+n log k(1+o(1)) bits A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 1 3 4 3 4 3 3 2 2 2 4 4 1 1 Space-Efficient Document Retrieval
Extensions • The approach can easily be extended to • report the documents in relevance order under standard scoring schemes like TF*IDF; and • show context around the first/several/all occurrences in selected documents. Space-Efficient Document Retrieval
Small experiment • 50MB English text • k=200 query time m=3 query time m=4 size Space-Efficient Document Retrieval