210 likes | 361 Views
To search or to crawl?: Towards a query optimizer for text-centric tasks. Presented by Avinash S Bharadwaj. How can data be extracted from the web?. Execution plans for text-centric tasks follow two general paradigms for processing a text database :
E N D
To search or to crawl?: Towards a query optimizer for text-centric tasks Presented by Avinash S Bharadwaj
How can data be extracted from the web? • Execution plans for text-centric tasks follow two general paradigms for processing a text database: • The entire web can be crawled or scanned for the text automatically • Search engine indexes can be used to retrieve the documents of interest using carefully constructed queries depending on the task.
Introduction • Text is ubiquitous and many applications rely on the text present in web pages for performing a variety of tasks. • Examples of text centric tasks • Reputation management systems download web pages to track the buzz around the companies. • Comparative shopping agents locate e-commerce web sites and add the products offered in the pages to their own index.
Examples of text centric tasks • According to the paper there are mainly three types of text centric tasks • Task 1: Information Extraction • Task 2: Content Summary Construction • Task 3: Focused Resource Discovery
Task 1: Information Extraction • Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
Task 2: Content Summary construction • Many text databases have valuable contents “hidden” behind search interfaces. • Metasearchers are used to search over multiple databases using a unified query interface. • Generation of content summary.
Task 3: Focused Resource Discovery • This task considers building applications based on a particular resource. • Simplest approach is to crawl the entire web and classify the web pages accordingly • Much more efficient approach is to use a focused crawler. • The focused crawlers depend documents and hyperlinks that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics.
An Abstract View of Text-Centric Tasks Text Database Extraction System Retrieve documents from database Process documents Extract output tokens
Execution Strategies • The paper describes four execution strategies. • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation Crawl Based Query or Index Based
Execution Strategy: Scan • Scan methodology processes each document in a database exhaustively until the number of tokens extracted satisfies the target recall. • The Scan execution strategy does not need training and does not send any queries to the database. • Execution time = |Retrieved Docs| · (R + P) • Prioritizing the documents based may help in improving efficiency. Time for retrieving a document Time for processing a document
Execution Strategy: Filtered Scan • Filtered scan is an improvement over the basic scan methodology. • Unlike scan filtered scan uses a classifier for a specific task to check whether the article contributes at least one token before parsing the article. • Execution time = |Retrieved Docs| · (R + P + C) Time for retrieving a document Time for classifying a document Time for processing a document
Execution Strategy: Iterative Set Expansion Text Database Extraction System Query Generation Process retrieved documents Extract tokensfrom docs Augment seed tokens with new tokens Query database with seed tokens (e.g., <Malaria, Ethiopia>) (e.g., [Ebola AND Zaire])
Execution Strategy: Iterative Set Expansion contd… • Iterative Set Expansion has been successfully applied in many tasks. • Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for answering a query Time for processing a document Time for retrieving a document
Execution Strategy: Automatic Query Generation • Iterative Set Expansion has recall limitation due to iterative nature of query generation • Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens. • Automatic Query Generation works in two stages: • In the first stage, Automatic Query Generation trains a classifier to categorize documents as useful or not for the task • In the execution stage, Automatic Query Generation searches a database using queries that are expected to retrieve useful documents.
Estimating Execution plan costs: Scan <SARS, China> Modeling Scan for Token t: • What is the probability of seeing t (with frequency g(t))after retrieving Sdocuments? • A “sampling without replacement” process • After retrievingSdocuments, frequency of token t follows hypergeometric distribution • Recall for tokent is the probability that frequency of t in S docs > 0 Probability of seeing token t after retrieving S documents g(t) = frequency of token t
Estimating Execution plan costs: Iterative Set Expansion Tokens Documents • The querying graph is a bipartite graph, containing tokens and documents • Each token (transformed to a keyword query) retrieves documents • Documents contain tokens t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>
Estimating execution plan costs: Iterative Set Expansion contd… We need to compute the: • Number of documents retrieved after sending Q tokens as queries (estimates time) • Number of tokens that appear in the retrieved documents (estimates recall) To estimate these we need to compute the: • Degree distribution of the tokens discovered by retrieving documents • Degree distribution of the documents retrieved by the tokens • (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) Tokens Documents t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>
Conclusions • Common execution plans for multiple text-centric tasks • Analytic models for predicting execution time and recall of various crawl- and query-based plans • Techniques for on-the-fly parameter estimation • Optimization framework picks on-the-fly the fastestplan for target recall