Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 7 (book chapter 9): Parallel and Distributed IR Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions • How to accelerate search? Same results as sequential • Ideas: • Quick-and-dirty rejection of bad objects, 100% recall • Fast data structure for search (based on clustering) • Careful check of all found candidates • Solution: mapping into fewer-D feature space • Condition: lower-bounding of the distance • Assumption: skewed spectrum distribution • Few coefficients concentrate energy, rest are less important

Previous Chapter: Research topics • Object detection (pattern and image recognition) • Automatic feature selection • Spatial indexing data structures (more than 1D) • New types of data. • What features to select? How to determine them? • Mixed-type data (e.g., webpages, or images withsound and description) • What clustering/IR methods are better suited forwhat features? (What features for what methods?) • Similar methods in data mining, ...

The problem • Very large document collections • Google: 4,000,000,000 pages • Slow response? • Solution: parallel computing • Google: 10,000 computers

Parallel architectures

MIMD architecture • The most common • Can be • tightly coupled • loosely coupled • Distributed • Many computers interacting via network • PC Clusters • Similar to MIMD computers, but greater cost of communication • very loosely coupled • More coarse-grained programs

Performance improvement Time: speedup S • Ideally, N times (number of processors) • In practice impossible • The problem does not decompose into N equal parts • Communication and control overhead • < 1 / f, where f is the largest separable fraction of theproblem Cost • Per processor: S / N

Two approaches to parallelism • Build new algorithms • E.g., neural nets • Naturally parallel • Problem: to define the retrieval task • Adapt the existing techniques to parallelism • Allows relying on well-studied approaches • We will consider this option

Ways to use parallelism • Multitasking • N search engines • Good for processing many queries Problems: • A single query is not speeded up • Bottleneck: disk access (index) • Possible solution: replicating (part of) data. RAIDs • Parallel algorithms • IR = data. Main question: how to partition the data • Document / index term matrix(terms can be LSI dimensions, signature bits, etc)

Possible partitionings • Horizontal: document partitioning. Union of results • Vertical: term partitioning. Basically, intersect results

Inverted files: Logical partitioning • Logical vs. physical document partitioning • Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion

Inverted files: Logical partitioning Construction and updating • Also parallel Construction • Assign docs to processors • Order docs such that each processor has an interval • Process in parallel • Merge. Each piece is ordered already

Inverted files:Physical document partitioning • Several separate collections, one per processor • Separate indices • Then the lists are merged (they are already ordered) • Priority queue is used • The result is not sorted; Insertion is quick • The maximal element can be found quickly • First k elements can be found rather quickly • Details in the book • Consistent scores are needed • Global statistics is needed. Can be computed at index time

Logical or physical partitioning? • Logical requires less communication • Faster • Physical is more flexible. Simpler implementation • Simpler conversion of existing systems

Inverted files: Term partitioning • Each processor processes a part of the inverted file • The results are intersected (for AND) • (or as appropriate for Boolean operations, OR and NOT) • When term distribution in user queries is skewed,then document partitioning is better • When uniform, term partitioning is better. • Twice for long queries, 5 – 10 times for short (Web-like)

Suffix arrays • Array construction can be parallelized • merges are parallel • Document partitioning is applied straightforwardly • Each processor maintains its own suffix array • Term partitioning can be applied • Each processor owns a branch of the tree (lexicographicinterval) • Bottleneck: all processors need access to the entire text

Signature files • Document partitioning: straightforward • Create query signature, distribute to each processor • Merge results (using Boolean operations if needed) • Term partitioning: shorter signatures • Merging and eliminating false drops is slow • This method is not recommended

SIMD computers • Single Instruction, Multiple data • Uncommon • Good for simple operations • Bit operations in signature files • Details in the book • Ranking is supported in hardware in some computers • If signature file does not fit into memory, can beprocessed in batches • I/O overhead • Use multiple queries with the same batch • This improves throughput, but not response time

… SIMD computers • Inverted files are difficult to adapt to SIMD • The inverted file is restructured • Details in the book

Distributed IR • MIMD with • Slow communication • Not all nodes are used for a given query • Encryption issues • Document partitioning is usually used • Term partitioning imposes greater communicationoverhead • Document clustering can be useful (to distribute docs by processors) • Index clusters and then search only the best ones • Another approach: use training queries, then similarity of the user query to these

Research topics • How to evaluate the speedup • New algorithms • Adaptation of existing algorithms • Merging the results is a bottleneck • Meta search engines • Creating large collections with judgements • Is recall important?

Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed Document partitioning is simple good for distributed computing Term partitioning is good for some data structures Distributed computing is MIMD computing with slow communication SIMD machines are good for Signature files Both are out of favor now Conclusions

Thank you! Till May 17? 18?, 6 pm

Alexander Gelbukh Gelbukh