20 likes | 153 Views
A Motivating Application. Notify the subscriber if an “interesting” document appears on the web. Problem Definition
E N D
A Motivating Application Notify the subscriber if an “interesting” document appears on the web Problem Definition • Given large number of subscriptions (in the order of millions) how can we efficiently match large number of incoming documents (thousands per second) against all subscriptions? Challenges • Scalability and load balancing • Support for enhanced subscription capabilities • Automatic resource (RSS) discovery and efficient crawling • Improved service (a longer history of matches, ranking) EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINESUtku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov*Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey • What is RSS? • Rich Site Summary (version 0.91) • RDF Site Summary (versions 0.9 and 1.0) • Really Simple Syndication (version 2.0) • Provides: • Web content (or summaries) • Meta-data (TITLE, URL and DESCRIPTION) • Goals: • Web Syndication • Allow readers to keep track of updates • Internal Representations for Efficient Matching • Use of Inverted Index: • Queries are indexed by their terms • Reduces the number of queries examined • Queries, Terms and Documents are • represented by unique identifiers (QIDs, TIDs, DIDs) • Comparison to Traditional Search • Retrospective Search: • On a previously crawled file collection • Searching the past • Collection of files is static • Queries are dynamic • Prospective Search: • On newly added or updated files • Searching the future • Files are dynamic • Collection of queries is static New: Q1 York: Q1 Yankees: Q1 Q2 Red: Q2 Q3 Sox: Q2 Q3 Boston: Q3 Q1: New York Yankees Q2: Yankees Red Sox Q3: Boston Red Sox • Query (Subscription) Types • AND only:All terms have to appear • k-out-of-n: At least k (out of all n) terms have to appear • Boolean: Boolean expression with AND, OR and NOT For more information please send email to uirmak@cis.poly.edu or suel@poly.edu.
1 1 1 EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES (continued)Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov*Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey • Datasets and Experimental Evaluations • Subscriptions: Query logs from excite.com • Documents: Crawled & parsed web pages • Evaluation: Throughput with various • numbers of subscriptions • A Primitive Matching Algorithm (AND only) • For each TID in the document • - Find queries that contain TID (using inverted index) • - Maintain a counter (for each query returned) • There is a match if (counter == query size) Opt 2: Use of Bloom Filters • Bloom Filter: A probabilistic, space- efficient method for membership queries • For each new item, set the corresponding bit to 1 • False negatives are guaranteed not to occur Advantage: Reduced cost of maintaining the accumulators Opt 3: Partitioning the Queries • Create multiple smaller inverted indexes • Repeat the matching algorithm Advantage: Better locality (in the processor cache) A Clustering Approach • Queries usually have common terms and some are contained by others • If a query is already evaluated on a document, contained queries can be answered very efficiently For more information please send email to uirmak@cis.poly.edu or suel@poly.edu.