10 likes | 142 Views
Web archives span over a long time. Searching Archives. Web archives continuously grow over time. Challenge Support search with temporal constraints. Scaling Archive S earch. Challenge Scale search to growing archives. obama @ [6/2009 – 6/2011].
E N D
Web archives span over a long time Searching Archives • Web archives continuously grow over time • Challenge • Support • search with temporal constraints Scaling ArchiveSearch • Challenge • Scale • search to growing archives obama@[6/2009 – 6/2011] Efficient and Scalable Archive SearchAvishekAnand • Need to design index structures which efficiently process time-travel queries and can be easily maintained. • Approach • Avoid accessing postings which do not overlap with query time-interval. • Index Sharding Shard 1 Doc 1 Index-list Doc 2 Doc 1 Doc 7 Doc 2 Doc 3 Shard 2 Doc 4 Doc 1 Doc 3 Doc 2 Doc 5 Doc 4 Doc 7 Doc 5 Doc 6 Doc 6 Doc 7 Doc 3 Doc 4 time Doc 5 • Approach • Avoid re-computation of the index by creating shards incrementally. query time-interval Doc 6 Index Maintenance • Index Sharding: • Partitions each index-list disjointly. • No index blow-up. Idealized Sharding: Eliminates access to postings with no intersection with query-time interval. Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs. • Incremental Sharding: • Online algorithm with approximation guarantee. • Append-only operation on shards. • Retains query performance. • System Architecture : Separate indexes for active and retired versions. • End-time arrival order: Versions finalized in their end-time-order. Doc 1: version 1 • Experiments Doc 2: version 9 SB : Vertical Partitioning with trade-off between performance and index size [3] CA : Cost Aware Sharding IS : Idealized Sharding Doc 3: version 2 INC : Incremental Sharding Doc 4: version 3 Doc 4: version 2 Sent to Archive Indexing System In the live index now • Archive Index Active Index Crawls • Appended popped posting • External-memory Archive Index • In-memory Archive Index • Inserted • incoming version Performance of incremental sharding Index maintenance efficiency Wallclock-times comparison with SB Index-size comparison • Archive Index Shards • Shard buffers References • [1] Index Sharding for Space-Time Efficiency in Archive Search : AvishekAnand, SrikantaBedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. • [2] Index Maintenance for Time-Travel Text Search : AvishekAnand, SrikantaBedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012. • [3] A Time Machine for Text Search : Klaus Berberich, SrikantaBedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007.