1 / 25

Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou

Efficient Index Updates over the Cloud. Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical University of Athens ikons@ece.ntua.gr Dimitrios Tsoumakos Ionian University dtsouma@ionio.gr Nectarios Koziris

harken
Download Presentation

Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Index Updates over the Cloud Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical University of Athens ikons@ece.ntua.gr Dimitrios Tsoumakos Ionian University dtsouma@ionio.gr Nectarios Koziris National Technical University of Athens nkoziris@ece.ntua.gr

  2. Requirements in the Web • Huge volume of datasets > 1.8 zettabytes, growing by 80% each year • Huge number of users > 2 billion users searching and updating web content • Explosion of User Generated Content Facebook: 90 updates/user/month, 30 billions/day Wikipedia: 30 updates/article/month, 8K new/day • Users demand fresh results

  3. Our contribution A distributed system which allows fastandfrequentupdates on web-scale Inverted Indexes. Incremental processing of updates Distributed processing - MapReduce Distributed index storage and serving – NoSQL

  4. Goals • Update time independent of existing index size • Fast and frequent updates on large indexes • Index consistency after an update • System stability and performance unaffected by updates • Scalability • Exploit large commodity clusters

  5. Inverted Index Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref)) Popular for fast content search, search engines Index Record: (term, doc_ref) Example:

  6. Related Work • Google, distributed index creation • Google Caffeine, fast and continuous index updates • Apache Solr, distributed search through index replication • Katta, distributed index creation and serving • CSLAB, distributed index creation and serving • LucidWorks, distributed index creation and updates on top of Solr(not open-source)

  7. Basic Update Procedure • Input: Collection of new/modified documents • For each new document: • Simply add each term to the corresponding list • For each modified document: • Delete all index records that refer to the old version • Add each term of the new version to the corresponding list

  8. Basic Update Procedure For modified documents we need to: • Obtain the indexed terms of the old version • Locate and delete the corresponding index records • Complexity depends on the schema of the index • Update time critically depends on these operations! • How can we do it efficiently?

  9. Proposed Schema • HBase: • Stores and indexes millions of columns per row • Stores varying number of columns for each row • Proposed Schema: • One row for every indexed term • One column for each document contained in the list of the corresponding term • Use the document ID as the column name

  10. Proposed Schema Each cell (row, column) corresponds to an index record (term, docID) • Advantages • Fast record discovery and deletion Almost independent of the list size • Disadvantages • Required storage space (overhead per column)

  11. Forward Index • Forward Index: List of terms of each document • Example: • Advantages: • Immediate access to the terms of old version • Retrieving the Forward Index is faster (smaller size) • Disadvantages: • Required storage space • Small overhead to the indexing process

  12. Minimizing Index Changes General Idea: • Modifications in the documents’ content are limited • Update the index based onlyon the content modifications • Procedure: • Compare the two different versions of each document • Delete the terms contained in the old version but not in the new • Add the terms contained in the new version but not in the old

  13. Minimizing Index Changes • No changes required for the common terms • Advantages: • Minimize the changes required to the index • Minimize costly insertions and deletions in Hbase • Minimize volume of intermediate K/V pairs (distributed) • Disadvantages: • Increased complexity of indexing process

  14. Distributed Index Updates • Better but still centralized! • Perfectly suited to the MapReduce logic: • Each document can be processed independently • The updates have to be merged before they are applied to the index • Utilizing MR model: • Easily distribute the processing • Exploit the resources of large commodity clusters

  15. Distributed Index Updates • Reducers: • For additions: • Create an index record for each pair • (term, docID) • Mappers: • Scan modified document • Retrieve old FI • Compare two versions • Combiners: • Merge the K/V pairs into a list of values per key • (only for additions and deletions) Bulk Load the output HFiles to HBase Content Table: The raw documents Forward Index Table: The Forward Index Inverted Index Table: The Inverted Index using the schema described in the previous slides Emit K/V pairs for additions (term, docID) Write the records to HFiles Emit a Key/Value pair for additions: (term, list(docID)) • For deletions: • Delete the corresponding cells using the • HBase Client API Emit K/V pairs for deletions (term, docID) Emit a Key/Value pair for deletions: (term, list(docID)) EmitK/V pairs forFI and Content

  16. Even Load Distribution • Two different types of keys: • Document ID: • One K/V pair for the Content and one for the FI of each document • Divide the keys into equally sized partitions using a hash function • Term: • Skewed-Zipfian distribution in natural languages • The number of values per key-term varies significantly

  17. Even Load Distribution • Solution:Sampling the input • Mappers: • Process a sample using the same algorithm • Emit a K/V per (term, 1) for each addition or deletion • Reducers:(1 for additions, 1 for deletions) • Count the occurrences to determine the splitting points • Indexer: • Loads the splitting points and chooses the reducer for each key

  18. Experimental Setup • Cluster: • 2-12 worker nodes (default: 8) • 8 cores @2GHz, 8GB RAM • Hadoop v.0.20.2-CDH3 (Cloudera) • HBase v.0.90.3-CDH3 (Cloudera) • 6 mappers and 6 reducers per node • Datasets: • Wikipedia snapshots on April 5, 2011 and May 26, 2011 • Default initial dataset: 64.2 GB, 23.7 million documents • Default update dataset: 15.4 GB, 2.2 million documents

  19. Experimental Results Evaluating our design choices • Comparison: Depends on the number of indexed terms • Forward Index: Important in both cases • Bulk Loading: Depends on the number of indexed terms • Sampling: Not important, small number of intermediate K/V pairs

  20. Experimental Results Update time vs. Update dataset size Update time linear to update dataset size For fixed size of initial dataset: 64.2GB (≈24 mil. documents)

  21. Experimental Results Update time vs. Initial Dataset Size 4X larger initial dataset size increases update time by less than 6% Update time roughly independent of the initial index size For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)

  22. Experimental Results Update time vs. Available resources (# of Mappers/Reducers) • 5X faster indexing from 2 to 12 nodes • Bulk loading to HBase does NOT scale as expected • 3.3X better performance in total For fixed size of initial/update datasets: 64.2GB/15.4GB

  23. Conclusion Incremental Processing: • Process updates, minimize required changes • Update time : • Almost independent of initial index size • Linear to the update dataset size Distributed Processing • Reduced update time • Scalability

  24. Conclusion Fast and frequent updates on web-scale Indexes • Wikipedia: >6X faster than index rebuild Disadvantages: • Slower index creation (done only once) • Increase in requiredstorage space (low cost)

  25. The End Questions… Thank you!

More Related