Tunable Compression of Word-level Index for Versioned Corpora

Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany

Introduction • Most document collections are not static • Intranet documents, Mail folders, Blogs, Source-code, and contents of the World Wide Web • Contents are being archived – possibly time-stamped and/or versioned • Wikis • Document repositories (SVN, CVS, …) • Desktop • Web Archives! • Search over evolving collections • Ability to query the collection “as of” given time • Time-travel Search [BBNW’07] EIIR 2008, Glasgow

Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and Controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

Historical Information Needs • News articles discussing Cola-drinks Cancer controversy during 2005-2006 • Contemporary articles about “Harry Potter and the Philosopher’s Stone” • Angela Merkel’s interview during 2002 EIIR 2008, Glasgow

Time-Travel Search Angela Merkel Interview @ 2002 Time-context for Evaluation & Ranking Keyword Query Keyword search extended with atime-context for evaluation Q = q @ ts Evaluate qusing the collection that existed at time ts • Key Challenges • Dealing with the MASSIVE size • Adapting the scoring models (typically defined for static collections) • Efficient query processing • Opportunities • Redundancy in content • Sufficiency of good approximations • Append-only data growth EIIR 2008, Glasgow

Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

FluxCapacitor/TTIX D1 D3 D1 D3 D2 1.87 1.6 1.9 2.0 2.2 [t0,t3) [t0,t2) [t2,t4) [t0,t1) [t3,t7) [Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007] Adapt Inverted Index structure to include validity time-interval of each document-version Version-history of Documents Time-stampedInverted Index Vocabulary Documents D1, D2, D3 are observed to have changed at different times D3 “deletion” D3 D2 Doc. Ids D1 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t10 now Time … … D2 D1 D3 D3 … xx xx xx xx • Index Compaction via Approximate Temporal Coalescing • A sublist materialization framework for trading off space-performance D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 … [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx [t0,t3) [t0,t2) [t3,t7) [t0,t1) EIIR 2008, Glasgow

Phrase Queries • Significantly improve effectiveness • Essential for quickly locating • entities – e.g., “Coca Cola”, “Where Eagles Dare”,… • concepts – e.g., “Water filtering” • … • Indexing for Phrase queries • For each word, need to store positional information for every occurrence • Index-size blowup • Size reduction via gap encoding + space-efficient coding on positions [Scholer et al. 2002] EIIR 2008, Glasgow

Phrase Queries in FluxCapacitor • Baseline:For each document version dtb, posting of the following structure Validity Time-interval (=64 bits) Document Identifier (=64 bits) List of Word-Positions • Word-positions compressed using standard techniques • (Gap + Elias-/Golomb-)encodings Can this be Improved? EIIR 2008, Glasgow

Word-Positions across Versions • High Level of Redundancy between versions • Append-only changes leave most parts unchanged • word b between dt1and dt2 • Numerical closeness of positions • Small shifts in positions • word c between dt2 and dt3 b: c: EIIR 2008, Glasgow

FUSION • Idea: • Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility • Positions: all word-positions in any of the versions • Timestamps: all intermediate version timestamps • Signatures: for each version, a bit-signature of positions b: c: EIIR 2008, Glasgow

Query Processing – win some, lose some • Save on overall space • Naïve organization + processing => reads the whole list, computes ranking • FUSION maintains smaller list, so faster (naïve) query processing • Who is Naïve !? • Skip pointers to jump ahead during query proc. • In the worst case,FUSION ends up reading and processing all the versions, instead of just one version! • Baseline - Good performance, Bad storageFUSION -Bad (worst-case) performance, Good storage EIIR 2008, Glasgow

Controlled FUSION • Compute a set of fusions over contiguous versions s.t. • It takes minimal storage for word positions • For any version, the maximum worst case query processing overhead is within η • Can be set up as an optimization problem • Optimal solution computable in O(n3) time and O(n) space • Assumption: storage cost is monotonous • In practice, we found it close to O(n2) EIIR 2008, Glasgow

Experimental Evaluation • English Wikipedia • Revision history (2004 – 2005) • 10% sample (~35,000 docs, ~900,000 ver.) • Baseline: • Elias- code: 97.51 GBytes • Elias- code: 97.77 GBytes • FUSION: • η between 1.1 – 10 • Elias- & Elias-  for compressing word-positions in each fused posting EIIR 2008, Glasgow

Experimental Results  = 1.5  44% of the baseline  = 1.5  35% of the baseline EIIR 2008, Glasgow

Conclusions • Time-travel Search • Key to archive search & analysis • An interesting and important problem! • Our Time-machine: FluxCapacitor/TTIX • Builds on inverted index framework • Tunable index-size reduction • FUSION • Adds phrase-querying to FluxCapacitor/TTIX • More than 50% space reduction over baseline • With 50% worst-case overhead in query proc. EIIR 2008, Glasgow

Thank You!Questions ?

Tunable Compression of Word-level Index for Versioned Corpora

Tunable Compression of Word-level Index for Versioned Corpora

Presentation Transcript

Improved Index Compression Techniques for Versioned Document Collections

Index Compression

V.2 Index Compression

Index Compression

Index Compression

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

Corpora for all

Efficient Indexing of Versioned Document Sequences

NMNH Collection Level Index

Index Compression

EQUIVALENCE AT WORD LEVEL

Tunable lasers

Index Compression

Single level index

Evaluating word sketches and corpora

Index Compression

EQUIVALENCE AT WORD LEVEL

Index Compression

Index construction: Compression of documents

Lecture 6: Index Compression

Global academic index, Academic Level