190 likes | 307 Views
Tunable Compression of Word-level Index for Versioned Corpora. Klaus Berberich, Srikanta Bedathur , Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany. Introduction. Most document collections are not static
E N D
Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany
Introduction • Most document collections are not static • Intranet documents, Mail folders, Blogs, Source-code, and contents of the World Wide Web • Contents are being archived – possibly time-stamped and/or versioned • Wikis • Document repositories (SVN, CVS, …) • Desktop • Web Archives! • Search over evolving collections • Ability to query the collection “as of” given time • Time-travel Search [BBNW’07] EIIR 2008, Glasgow
Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and Controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow
Historical Information Needs • News articles discussing Cola-drinks Cancer controversy during 2005-2006 • Contemporary articles about “Harry Potter and the Philosopher’s Stone” • Angela Merkel’s interview during 2002 EIIR 2008, Glasgow
Time-Travel Search Angela Merkel Interview @ 2002 Time-context for Evaluation & Ranking Keyword Query Keyword search extended with atime-context for evaluation Q = q @ ts Evaluate qusing the collection that existed at time ts • Key Challenges • Dealing with the MASSIVE size • Adapting the scoring models (typically defined for static collections) • Efficient query processing • Opportunities • Redundancy in content • Sufficiency of good approximations • Append-only data growth EIIR 2008, Glasgow
Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow
FluxCapacitor/TTIX D1 D3 D1 D3 D2 1.87 1.6 1.9 2.0 2.2 [t0,t3) [t0,t2) [t2,t4) [t0,t1) [t3,t7) [Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007] Adapt Inverted Index structure to include validity time-interval of each document-version Version-history of Documents Time-stampedInverted Index Vocabulary Documents D1, D2, D3 are observed to have changed at different times D3 “deletion” D3 D2 Doc. Ids D1 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t10 now Time … … D2 D1 D3 D3 … xx xx xx xx • Index Compaction via Approximate Temporal Coalescing • A sublist materialization framework for trading off space-performance D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 … [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx [t0,t3) [t0,t2) [t3,t7) [t0,t1) EIIR 2008, Glasgow
Phrase Queries • Significantly improve effectiveness • Essential for quickly locating • entities – e.g., “Coca Cola”, “Where Eagles Dare”,… • concepts – e.g., “Water filtering” • … • Indexing for Phrase queries • For each word, need to store positional information for every occurrence • Index-size blowup • Size reduction via gap encoding + space-efficient coding on positions [Scholer et al. 2002] EIIR 2008, Glasgow
Phrase Queries in FluxCapacitor • Baseline:For each document version dtb, posting of the following structure Validity Time-interval (=64 bits) Document Identifier (=64 bits) List of Word-Positions • Word-positions compressed using standard techniques • (Gap + Elias-/Golomb-)encodings Can this be Improved? EIIR 2008, Glasgow
Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow
Word-Positions across Versions • High Level of Redundancy between versions • Append-only changes leave most parts unchanged • word b between dt1and dt2 • Numerical closeness of positions • Small shifts in positions • word c between dt2 and dt3 b: c: EIIR 2008, Glasgow
FUSION • Idea: • Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility • Positions: all word-positions in any of the versions • Timestamps: all intermediate version timestamps • Signatures: for each version, a bit-signature of positions b: c: EIIR 2008, Glasgow
Query Processing – win some, lose some • Save on overall space • Naïve organization + processing => reads the whole list, computes ranking • FUSION maintains smaller list, so faster (naïve) query processing • Who is Naïve !? • Skip pointers to jump ahead during query proc. • In the worst case,FUSION ends up reading and processing all the versions, instead of just one version! • Baseline - Good performance, Bad storageFUSION -Bad (worst-case) performance, Good storage EIIR 2008, Glasgow
Controlled FUSION • Compute a set of fusions over contiguous versions s.t. • It takes minimal storage for word positions • For any version, the maximum worst case query processing overhead is within η • Can be set up as an optimization problem • Optimal solution computable in O(n3) time and O(n) space • Assumption: storage cost is monotonous • In practice, we found it close to O(n2) EIIR 2008, Glasgow
Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow
Experimental Evaluation • English Wikipedia • Revision history (2004 – 2005) • 10% sample (~35,000 docs, ~900,000 ver.) • Baseline: • Elias- code: 97.51 GBytes • Elias- code: 97.77 GBytes • FUSION: • η between 1.1 – 10 • Elias- & Elias- for compressing word-positions in each fused posting EIIR 2008, Glasgow
Experimental Results = 1.5 44% of the baseline = 1.5 35% of the baseline EIIR 2008, Glasgow
Conclusions • Time-travel Search • Key to archive search & analysis • An interesting and important problem! • Our Time-machine: FluxCapacitor/TTIX • Builds on inverted index framework • Tunable index-size reduction • FUSION • Adds phrase-querying to FluxCapacitor/TTIX • More than 50% space reduction over baseline • With 50% worst-case overhead in query proc. EIIR 2008, Glasgow