Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou

Efficient Index Updates over the Cloud Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical University of Athens ikons@ece.ntua.gr Dimitrios Tsoumakos Ionian University dtsouma@ionio.gr Nectarios Koziris National Technical University of Athens nkoziris@ece.ntua.gr

Requirements in the Web • Huge volume of datasets > 1.8 zettabytes, growing by 80% each year • Huge number of users > 2 billion users searching and updating web content • Explosion of User Generated Content Facebook: 90 updates/user/month, 30 billions/day Wikipedia: 30 updates/article/month, 8K new/day • Users demand fresh results

Our contribution A distributed system which allows fastandfrequentupdates on web-scale Inverted Indexes. Incremental processing of updates Distributed processing - MapReduce Distributed index storage and serving – NoSQL

Goals • Update time independent of existing index size • Fast and frequent updates on large indexes • Index consistency after an update • System stability and performance unaffected by updates • Scalability • Exploit large commodity clusters

Inverted Index Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref)) Popular for fast content search, search engines Index Record: (term, doc_ref) Example:

Related Work • Google, distributed index creation • Google Caffeine, fast and continuous index updates • Apache Solr, distributed search through index replication • Katta, distributed index creation and serving • CSLAB, distributed index creation and serving • LucidWorks, distributed index creation and updates on top of Solr(not open-source)

Basic Update Procedure • Input: Collection of new/modified documents • For each new document: • Simply add each term to the corresponding list • For each modified document: • Delete all index records that refer to the old version • Add each term of the new version to the corresponding list

Basic Update Procedure For modified documents we need to: • Obtain the indexed terms of the old version • Locate and delete the corresponding index records • Complexity depends on the schema of the index • Update time critically depends on these operations! • How can we do it efficiently?

Proposed Schema • HBase: • Stores and indexes millions of columns per row • Stores varying number of columns for each row • Proposed Schema: • One row for every indexed term • One column for each document contained in the list of the corresponding term • Use the document ID as the column name

Proposed Schema Each cell (row, column) corresponds to an index record (term, docID) • Advantages • Fast record discovery and deletion Almost independent of the list size • Disadvantages • Required storage space (overhead per column)

Forward Index • Forward Index: List of terms of each document • Example: • Advantages: • Immediate access to the terms of old version • Retrieving the Forward Index is faster (smaller size) • Disadvantages: • Required storage space • Small overhead to the indexing process

Minimizing Index Changes General Idea: • Modifications in the documents’ content are limited • Update the index based onlyon the content modifications • Procedure: • Compare the two different versions of each document • Delete the terms contained in the old version but not in the new • Add the terms contained in the new version but not in the old

Minimizing Index Changes • No changes required for the common terms • Advantages: • Minimize the changes required to the index • Minimize costly insertions and deletions in Hbase • Minimize volume of intermediate K/V pairs (distributed) • Disadvantages: • Increased complexity of indexing process

Distributed Index Updates • Better but still centralized! • Perfectly suited to the MapReduce logic: • Each document can be processed independently • The updates have to be merged before they are applied to the index • Utilizing MR model: • Easily distribute the processing • Exploit the resources of large commodity clusters

Distributed Index Updates • Reducers: • For additions: • Create an index record for each pair • (term, docID) • Mappers: • Scan modified document • Retrieve old FI • Compare two versions • Combiners: • Merge the K/V pairs into a list of values per key • (only for additions and deletions) Bulk Load the output HFiles to HBase Content Table: The raw documents Forward Index Table: The Forward Index Inverted Index Table: The Inverted Index using the schema described in the previous slides Emit K/V pairs for additions (term, docID) Write the records to HFiles Emit a Key/Value pair for additions: (term, list(docID)) • For deletions: • Delete the corresponding cells using the • HBase Client API Emit K/V pairs for deletions (term, docID) Emit a Key/Value pair for deletions: (term, list(docID)) EmitK/V pairs forFI and Content

Even Load Distribution • Two different types of keys: • Document ID: • One K/V pair for the Content and one for the FI of each document • Divide the keys into equally sized partitions using a hash function • Term: • Skewed-Zipfian distribution in natural languages • The number of values per key-term varies significantly

Even Load Distribution • Solution:Sampling the input • Mappers: • Process a sample using the same algorithm • Emit a K/V per (term, 1) for each addition or deletion • Reducers:(1 for additions, 1 for deletions) • Count the occurrences to determine the splitting points • Indexer: • Loads the splitting points and chooses the reducer for each key

Experimental Setup • Cluster: • 2-12 worker nodes (default: 8) • 8 cores @2GHz, 8GB RAM • Hadoop v.0.20.2-CDH3 (Cloudera) • HBase v.0.90.3-CDH3 (Cloudera) • 6 mappers and 6 reducers per node • Datasets: • Wikipedia snapshots on April 5, 2011 and May 26, 2011 • Default initial dataset: 64.2 GB, 23.7 million documents • Default update dataset: 15.4 GB, 2.2 million documents

Experimental Results Evaluating our design choices • Comparison: Depends on the number of indexed terms • Forward Index: Important in both cases • Bulk Loading: Depends on the number of indexed terms • Sampling: Not important, small number of intermediate K/V pairs

Experimental Results Update time vs. Update dataset size Update time linear to update dataset size For fixed size of initial dataset: 64.2GB (≈24 mil. documents)

Experimental Results Update time vs. Initial Dataset Size 4X larger initial dataset size increases update time by less than 6% Update time roughly independent of the initial index size For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)

Experimental Results Update time vs. Available resources (# of Mappers/Reducers) • 5X faster indexing from 2 to 12 nodes • Bulk loading to HBase does NOT scale as expected • 3.3X better performance in total For fixed size of initial/update datasets: 64.2GB/15.4GB

Conclusion Incremental Processing: • Process updates, minimize required changes • Update time : • Almost independent of initial index size • Linear to the update dataset size Distributed Processing • Reduced update time • Scalability

Conclusion Fast and frequent updates on web-scale Indexes • Wikipedia: >6X faster than index rebuild Disadvantages: • Slower index creation (done only once) • Increase in requiredstorage space (low cost)

The End Questions… Thank you!

Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou