Papers on Parallel IR

Papers on Parallel IR • Agenda • Introduction • Paper 1:Inverted file partitioning schemes in multiple disk systems • Paper 2: Parallel search using partitioned inverted files • Comparison • Conclusion • URL Links to Paper Parallel IR

Parallel IR Introduction Parallelism in Query processing involves: • Multitasking Simultaneous Queries A thread or process for each user query, that can execute on a CPU The same thread or process completes an entire single query Ability to handle multiple concurrent queries • Query Partitioning A single query is broken into sub tasks Each sub task can run in parallel Improves Response Time of a single Query Parallel IR

Partitioning Query into Sub Tasks • IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks • Document Partitioning • Divides documents over sub tasks, so that each sub task processes a sub set of the documents • Term Partitioning • Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks Parallel IR

Theme of Papers being presented…. • Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ • A) Document Partitioning • B) Index Term Partitioning • Paper1: Inverted file partitioning schemes in multiple disk systems • Paper2: Parallel search using partitioned inverted files Parallel IR

P1: Inverted File Systems • Inverted File System consists of: • Index File: Ordered list of all keywords that have been used to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file • Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file • Document File: contains the actual document records of the collection Parallel IR

P1: Inverted File Systems ( cont ) Parallel IR

P1: Load Balancing In a multiple CPU, multiple disk system we need to: • Balance the Load on Processors • Need to maximize CPU utilization • Balance the Load on the I/O devices i.e. disk drives • Avoid I/O bottle necks which will cause CPUs to go in wait states Parallel IR

P1:Partitioning an Inverted File The paper explores the 2 schemes: • Based on Term Id • Based on Document Id • With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id • We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id. Parallel IR

P1:Partitioning an Inverted File ( cont) Parallel IR

P1: Objective of Partitioning Inverted Index • Objective: To maximize performance • Ideal: All I/O channels and Disk drives are equally used when sub tasks of a query gets executed in parallel • However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit • Paper recognizes that I/O is a major cost factor in IR Parallel IR

A Brief Comparison Parallel IR

A Brief Comparison… • The Main Important Difference: Different I/O characteristic: A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk. Which is better? – It is a tradeoff……… Parallel IR

P1: Simulation Model • To compare the two schemes the paper defines a simulation model with the following factors: • Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance • User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability Parallel IR

P1: Simulation Model.. Cont.. c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued d) Work Load Model : Vary the number of disks and CPUs Parallel IR

Simulation Results • Increasing the number of disks up to a threshold improves performance, by decreasing the response time • When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best • When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks Parallel IR

Paper 2 - Positioning w.r.t. Paper 1 • The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O • What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle • Paper 2 recognizes that most user queries are single term only. Why? Parallel IR

P2: Search Topology Framework • P2’s proposes a different framework: Parallel IR

P2: Search Topology ( Cont..) • Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results. • Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each. Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based. Parallel IR

P2: Approach • The paper uses real web collections instead of simulations for experimentations • The PLIERS system is used on a 8 to 12 nodes AP3000 m/c. • The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection • Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track. • Title only and whole topic queries were used • DocId and TermId index partitioning was used • Bottom Line: Real Data instead of simulation Parallel IR

P2: Summary of Results Within the framework of the experiment: • DocId partitioning is better in a multiprocessor environment, than TermId Partitioning • TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node Parallel IR

Comparison Parallel IR

Conclusion In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes Parallel IR

URL Links to Paper Paper 1:Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on , Volume: 6 Issue: 2 , Feb 1995 http://ieeexplore.ieee.org/iel4/71/8001/00342125.pdf?isNumber=8001&prod=IEEE+JNL&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong-Soo+Jeong%3B+Omiecinski%2C+E.%3B Paper 2:Parallel search using partitioned inverted files MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on , 2000 http://ieeexplore.ieee.org/iel5/7055/19010/00878197.pdf?isNumber=19010&prod=IEEE+CNF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+McCann%2C+J.A.%3B+Robertson%2C+S.E.%3B Parallel IR

Papers on Parallel IR

Papers on Parallel IR

Presentation Transcript

Workshop on Parallel Computing

Parallel and Distributed IR

Audit on IR 3.3.09

Seminar on Course Papers

General Feedback on papers

On Parallel Repetition

Papers on Storage Systems

SHARE: Discovery:Focus on papers

On-line Parallel Tomography

Parallel and Distributed IR

Discussion on Papers

Based on the papers:

Seminar on parallel computing

COMMENT ON THE PAPERS

Papers on the Second Quiz

More on Parallel Computing

Scheduling on Parallel Systems

Parallel Session on Metadata

Parallel and Distributed IR

Scheduling on Parallel Systems

On Writing Research Papers

OPERATIONS ON PARALLEL OR NEAR-PARALLEL RUNWAYS