To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

To Search or to Crawl?Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia University

Text-Centric Task I: Information Extraction • Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

Other Text-Centric Tasks • Task II: Database Selection • Task III: Focused Crawling Details in the paper

For the rest of the talk An Abstract View of Text-Centric Tasks Text Database Extraction System Retrieve documents from database Process documents Extract output tokens

Executing a Text-Centric Task Text Database Extraction System Similar to relational world Retrieve documents from database Process documents Extract output tokens Two major execution paradigms • Scan-based: Retrieve and process documents sequentially • Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best • Indexes are only “approximate”: index is on keywords, not on tokens of interest • Choice of execution plan affects output completeness (not only speed) Unlike the relational world

Execution Plan Characteristics Question: How do we choose the fastestexecution plan for reaching a targetrecall ? Text Database Extraction System Retrieve documents from database Process documents Extract output tokens Execution Plans have two main characteristics: • Execution Time • Recall (fraction of tokens retrieved) “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

Outline • Description and analysis of crawl- and query-based plans • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation • Optimization strategy • Experimental results and conclusions Crawl-based Query-based (Index-based)

Scan Text Database Extraction System • Scanretrieves and processes documentssequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Retrieve docs from database Process documents Extract output tokens Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Time for retrieving a document Filtered Scanuses a classifier to identify and process only promising documents (details in paper)

S documents Estimating Recall of Scan <SARS, China> Modeling Scan for Token t: • What is the probability of seeing t (with frequency g(t)) after retrieving S documents? • A “sampling without replacement” process • After retrieving S documents, frequency of token t follows hypergeometric distribution • Recall for tokent is the probability that frequency of t in S docs > 0 Probability of seeing token t after retrieving S documents g(t) = frequency of token t

Estimating Recall of Scan <SARS, China> <Ebola, Zaire> Modeling Scan: • Multiple “sampling without replacement” processes, one for each token • Overall recall is average recall across tokens → We can compute number of documents required to reach target recall Execution time = |Retrieved Docs| · (R + P)

Outline • Description and analysis of crawl- and query-based plans • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation • Optimization strategy • Experimental results and conclusions Crawl-based Query-based

Iterative Set Expansion Text Database Extraction System Query Generation Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Process retrieved documents Extract tokensfrom docs Augment seed tokens with new tokens Query database with seed tokens (e.g., <Malaria, Ethiopia>) (e.g., [Ebola AND Zaire]) Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for answering a query Time for retrieving a document Time for processing a document

Querying Graph Tokens Documents • The querying graph is a bipartite graph, containing tokens and documents • Each token (transformed to a keyword query) retrieves documents • Documents contain tokens t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>

Using Querying Graph for Analysis Tokens Documents We need to compute the: • Number of documents retrieved after sending Q tokens as queries (estimates time) • Number of tokens that appear in the retrieved documents (estimates recall) To estimate these we need to compute the: • Degree distribution of the tokens discovered by retrieving documents • Degree distribution of the documents retrieved by the tokens • (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam> Elegant analysis framework based on generating functions – details in the paper

Recall Limit: ReachabilityGraph ReachabilityGraph Tokens Documents t1 t1 d1 t2 t3 d2 t2 t3 d3 t4 t5 t4 d4 t1retrieves document d1that contains t2 t5 d5 Upper recall limit: determined by the size of the biggest connected component

Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens Automatic Query Generation Details in the paper

Outline • Description and analysis of crawl- and query-based plans • Optimization strategy • Experimental results and conclusions

Summary of Cost Analysis • Our analysis so far: • Takes as input a target recall • Gives as output the time for each plan to reach target recall(time = infinity, if plan cannot reach target recall) • Time and recall depend on task-specific properties of database: • Token degree distribution • Document degree distribution • Next, we show how to estimate degree distributions on-the-fly

Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Can characterize distributions with only a few parameters!

Parameter Estimation • Naïve solution for parameter estimation: • Start with separate, “parameter-estimation” phase • Perform random sampling on database • Stop when cross-validation indicates high confidence • We can do better than this! • No need for separate sampling phase • Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution

Initial, default estimation Updated estimation Updated estimation On-the-fly Parameter Estimation Correct (but unknown) distribution • Pick most promising execution plan for target recall assuming “default” parameter values • Start executing task • Update parameter estimates during execution • Switch plan if updated statistics indicate so Important • Only Scan acts as “random sampling” • All other execution plan need parameter adjustment (see paper)

Outline • Description and analysis of crawl- and query-based plans • Optimization strategy • Experimental results and conclusions

Correctness of Theoretical Analysis • Solid lines: Actual time • Dotted lines: Predicted time with correct parameters Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 16,921 tokens

Experimental Results (Information Extraction) • Solid lines: Actual time • Green line: Time with optimizer (results similar in other experiments – see paper)

Conclusions • Common execution plans for multiple text-centric tasks • Analytic models for predicting execution time and recall of various crawl- and query-based plans • Techniques for on-the-fly parameter estimation • Optimization framework picks on-the-fly the fastestplan for target recall

Future Work • Incorporate precision and recall of extraction system in framework • Create non-parametric optimization (i.e., no assumption about distribution families) • Examine other text-centric tasks and analyze new execution plans • Create adaptive, “next-K” optimizer

Thank you!

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks