1 / 1

Efficient and Scalable Archive Search Avishek Anand

Web archives span over a long time. Searching Archives. Web archives continuously grow over time. Challenge Support search with temporal constraints. Scaling Archive S earch. Challenge Scale search to growing archives. obama @ [6/2009 – 6/2011].

camdyn
Download Presentation

Efficient and Scalable Archive Search Avishek Anand

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web archives span over a long time Searching Archives • Web archives continuously grow over time • Challenge • Support • search with temporal constraints Scaling ArchiveSearch • Challenge • Scale • search to growing archives obama@[6/2009 – 6/2011] Efficient and Scalable Archive SearchAvishekAnand • Need to design index structures which efficiently process time-travel queries and can be easily maintained. • Approach • Avoid accessing postings which do not overlap with query time-interval. • Index Sharding Shard 1 Doc 1 Index-list Doc 2 Doc 1 Doc 7 Doc 2 Doc 3 Shard 2 Doc 4 Doc 1 Doc 3 Doc 2 Doc 5 Doc 4 Doc 7 Doc 5 Doc 6 Doc 6 Doc 7 Doc 3 Doc 4 time Doc 5 • Approach • Avoid re-computation of the index by creating shards incrementally. query time-interval Doc 6 Index Maintenance • Index Sharding: • Partitions each index-list disjointly. • No index blow-up. Idealized Sharding: Eliminates access to postings with no intersection with query-time interval. Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs. • Incremental Sharding: • Online algorithm with approximation guarantee. • Append-only operation on shards. • Retains query performance. • System Architecture : Separate indexes for active and retired versions. • End-time arrival order: Versions finalized in their end-time-order. Doc 1: version 1 • Experiments Doc 2: version 9 SB : Vertical Partitioning with trade-off between performance and index size [3] CA : Cost Aware Sharding IS : Idealized Sharding Doc 3: version 2 INC : Incremental Sharding Doc 4: version 3 Doc 4: version 2 Sent to Archive Indexing System In the live index now • Archive Index Active Index Crawls • Appended popped posting • External-memory Archive Index • In-memory Archive Index • Inserted • incoming version Performance of incremental sharding Index maintenance efficiency Wallclock-times comparison with SB Index-size comparison • Archive Index Shards • Shard buffers References • [1] Index Sharding for Space-Time Efficiency in Archive Search : AvishekAnand, SrikantaBedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. • [2] Index Maintenance for Time-Travel Text Search : AvishekAnand, SrikantaBedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012. • [3] A Time Machine for Text Search : Klaus Berberich, SrikantaBedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007.

More Related