230 likes | 424 Views
Papers on Parallel IR. Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel search using partitioned inverted files Comparison Conclusion URL Links to Paper. Parallel IR Introduction. Parallelism in Query processing involves:
E N D
Papers on Parallel IR • Agenda • Introduction • Paper 1:Inverted file partitioning schemes in multiple disk systems • Paper 2: Parallel search using partitioned inverted files • Comparison • Conclusion • URL Links to Paper Parallel IR
Parallel IR Introduction Parallelism in Query processing involves: • Multitasking Simultaneous Queries A thread or process for each user query, that can execute on a CPU The same thread or process completes an entire single query Ability to handle multiple concurrent queries • Query Partitioning A single query is broken into sub tasks Each sub task can run in parallel Improves Response Time of a single Query Parallel IR
Partitioning Query into Sub Tasks • IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks • Document Partitioning • Divides documents over sub tasks, so that each sub task processes a sub set of the documents • Term Partitioning • Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks Parallel IR
Theme of Papers being presented…. • Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ • A) Document Partitioning • B) Index Term Partitioning • Paper1: Inverted file partitioning schemes in multiple disk systems • Paper2: Parallel search using partitioned inverted files Parallel IR
P1: Inverted File Systems • Inverted File System consists of: • Index File: Ordered list of all keywords that have been used to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file • Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file • Document File: contains the actual document records of the collection Parallel IR
P1: Inverted File Systems ( cont ) Parallel IR
P1: Load Balancing In a multiple CPU, multiple disk system we need to: • Balance the Load on Processors • Need to maximize CPU utilization • Balance the Load on the I/O devices i.e. disk drives • Avoid I/O bottle necks which will cause CPUs to go in wait states Parallel IR
P1:Partitioning an Inverted File The paper explores the 2 schemes: • Based on Term Id • Based on Document Id • With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id • We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id. Parallel IR
P1:Partitioning an Inverted File ( cont) Parallel IR
P1: Objective of Partitioning Inverted Index • Objective: To maximize performance • Ideal: All I/O channels and Disk drives are equally used when sub tasks of a query gets executed in parallel • However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit • Paper recognizes that I/O is a major cost factor in IR Parallel IR
A Brief Comparison Parallel IR
A Brief Comparison… • The Main Important Difference: Different I/O characteristic: A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk. Which is better? – It is a tradeoff……… Parallel IR
P1: Simulation Model • To compare the two schemes the paper defines a simulation model with the following factors: • Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance • User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability Parallel IR
P1: Simulation Model.. Cont.. c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued d) Work Load Model : Vary the number of disks and CPUs Parallel IR
Simulation Results • Increasing the number of disks up to a threshold improves performance, by decreasing the response time • When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best • When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks Parallel IR
Paper 2 - Positioning w.r.t. Paper 1 • The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O • What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle • Paper 2 recognizes that most user queries are single term only. Why? Parallel IR
P2: Search Topology Framework • P2’s proposes a different framework: Parallel IR
P2: Search Topology ( Cont..) • Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results. • Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each. Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based. Parallel IR
P2: Approach • The paper uses real web collections instead of simulations for experimentations • The PLIERS system is used on a 8 to 12 nodes AP3000 m/c. • The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection • Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track. • Title only and whole topic queries were used • DocId and TermId index partitioning was used • Bottom Line: Real Data instead of simulation Parallel IR
P2: Summary of Results Within the framework of the experiment: • DocId partitioning is better in a multiprocessor environment, than TermId Partitioning • TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node Parallel IR
Comparison Parallel IR
Conclusion In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes Parallel IR
URL Links to Paper Paper 1:Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on , Volume: 6 Issue: 2 , Feb 1995 http://ieeexplore.ieee.org/iel4/71/8001/00342125.pdf?isNumber=8001&prod=IEEE+JNL&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong-Soo+Jeong%3B+Omiecinski%2C+E.%3B Paper 2:Parallel search using partitioned inverted files MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on , 2000 http://ieeexplore.ieee.org/iel5/7055/19010/00878197.pdf?isNumber=19010&prod=IEEE+CNF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+McCann%2C+J.A.%3B+Robertson%2C+S.E.%3B Parallel IR