1 / 34

Improving Query Results using Answer Corroboration

Improving Query Results using Answer Corroboration. Amélie Marian Rutgers University. Motivations. Query on databases traditionally return exact answer (set of) tuples that match query exactly Query in Information retrieval traditionally return best documents containing the answer

cargan
Download Presentation

Improving Query Results using Answer Corroboration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Query Results using Answer Corroboration Amélie Marian Rutgers University

  2. Motivations • Query on databases traditionally return exact answer • (set of) tuples that match query exactly • Query in Information retrieval traditionally return best documents containing the answer • (list of) documents from which users have to find relevant information within the documents • Both query models are insufficient for today’s information needs • New models have been used and studied: top-k queries, question answering (QA) But these model consider answers individually (except for some QA systems) Amélie Marian - Rutgers University

  3. Data Corroboration • Data sources cannot be fully trusted • Low quality data (e.g., data integration, user-input data) • Web data (anybody can say anything on the web) • Non exact query models • Top-k answers are requested • Repeated information leads more credence to the quality of the information • Aggregate similar information, and increase its score Amélie Marian - Rutgers University

  4. Outline • Answer Corroboration for Data Cleaning joint work with Yannis Kotidis and Divesh Srivastava • Motivations • Multiple Join Path Framework • Our Approach • Experimental Evaluation • Answer Corroboration for Web Search • Motivations • Our Approach • Query Interface Amélie Marian - Rutgers University

  5. Motivating Example Sales Inventory CircuitID TN BAN TN CustName CircuitID TN PON PON TN Ordering CustName BAN ORN TN TN ORN TN: Telephone Number ORN: Order Number BAN: Billing Account Number PON: Provisioning Order Number SubPON: Related PON Provisioning CustName PON SubPON CustName What is the Circuit ID associated with a Telephone Number that appears in SALES? Amélie Marian - Rutgers University

  6. Motivations • Data applications with overlapping features • Data integration • Web sources • Data quality issues (duplicate, null, default values, data inconsistencies) • Data-entry problems • Data integration problems Amélie Marian - Rutgers University

  7. Contributions • Multiple Join Path (MJP) framework • Quantifies answer quality • Takes corroborating evidence into account • Agglomerative scoring of answers • Answer computation techniques • Designed for MJP scoring methodologies • Several output options (top-k, top-few) • Experimental evaluation on real data • VIP integration platform • Quality of answers • Efficiency of our techniques Amélie Marian - Rutgers University

  8. Multiple Join Path Framework: Problem Definition • Query of the form: “Given X=a find the value of Y” Examples: Given a telephone number of a customer, find the ID of the circuit to which the telephone line is attached. One answer expected Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID. Possibly several answers Amélie Marian - Rutgers University

  9. Directed acyclic graph Nodes are field names Intra-application edge Links fields in the same application Inter-application edge Links fields across applications Schema Graph All (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications Amélie Marian - Rutgers University

  10. Given a specific value of the source node X what are values of the sink node Y? Considers all join paths from X to Y in the schema graph Data Graph X (no corresponding SALES.BAN) X X Example: two paths lead to answer c1 Amélie Marian - Rutgers University

  11. Scoring Answers • Which are the correct values? • Unclean data • No a priori knowledge • Technique to score data edges • What is the probability that the fields associated by the edge is correct • Probabilistic interpretation of data edge scores to score full join paths • Edge score aggregation • Independent on the length of the path Amélie Marian - Rutgers University

  12. Scoring Data Edges • Rely on functional dependencies (we are considering fields that are keys) • Data edge scores model the error in the data • Intra-application edge • Inter-application edge equals 1, unless approximate matching Fields A and B within the same application A B (and symetrically for B -> A) Where bi are the values instantiated from querying the application with value a A B and B A Amélie Marian - Rutgers University

  13. A single data path is scored using a simple sequential composition of its data edges probabilities Data paths leading to the same answer are scored using parallel composition Scoring Data Paths Independence Assumption X a b Y 0.5 0.8 0.6 pathScore=0.5*0.8*0.6=0.24 c 0.5 0.4 X a b Y 0.5 0.8 0.6 pathScore=0.24+0.2-(0.24*0.2) pathScore=0.392 Amélie Marian - Rutgers University

  14. Identifying Answers • Only interested in best answers • Standard top-k techniques do not apply • Answer scores can always be increased by new information • We keep score range information • Return top answers when identified, may not have complete scores (similar to NRA by Fagin et al.) • Two return strategies • Top-k • Top-few (weaker stop condition) Amélie Marian - Rutgers University

  15. Computing Answers • Take advantage of early pruning • Only interested in best answers • Incremental data graph computation • Probes to each applications • Cost model is number of probes • Standard graph searching techniques (DFS, BFS) do not take advantage of score information • We propose a technique based on the notion of maximum benefit Amélie Marian - Rutgers University

  16. Maximum Benefit • Benefit computation of a path uses two components • Known scores of the explored data edges • Best way to augment an answer’s scores • Uses residual benefit of unexplored schema edges • Our strategy makes choices that aim at maximizing this benefit metric Amélie Marian - Rutgers University

  17. VIP Experimental Platform • Integration platform developed at AT&T • 30 legacy systems • Real data • Developed as a platform for resolving disputes between applications that are due to data inconsistencies • Front-end web interface Amélie Marian - Rutgers University

  18. VIP Queries • Random sample of 150 user queries. • Analysis shows that queries can be classified according to the number of answers they retrieve: • noAnswer(nA): 56 queries • anyAnswer(aA): 94 queries • oneLarge(oL): 47 queries • manyLarge(mL): 4 queries • manySmall(mS): 8 queries • heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query Amélie Marian - Rutgers University

  19. VIP Schema Graph Paths leading to an answer /paths leading to top-1 answer (94 queries) Not considering all paths may lead to missing top-1 answers Amélie Marian - Rutgers University

  20. Number of Parallel Paths Contributing to the Top-1 Answer Average of 10 parallel paths per answer, 2.5 significant Amélie Marian - Rutgers University

  21. Cost of Execution Amélie Marian - Rutgers University

  22. Related Work (Data Cleaning) • Keyword Search in DBMS (BANKS, DBXPlorer, DISCOVER, ObjectRank) • Query is set of keywords • Top-k query model • DB as data graph • Do not agglomerate scores • Top-k query evaluation (TA, MPro, Upper) • Consider tuples as an entity • Wait for exact answer (Except for NRA) • Do not agglomerate scores • Probabilistic ranking of DB results • Queries not selective, large answer set We take corroborative evidence into account to rank query results Amélie Marian - Rutgers University

  23. Contributions • Multiple Join Path Framework • Uses corroborating evidence to identify high quality results • Looks at all paths in the schema graph • Scoring mechanism • Probabilistic interpretation • Takes schema information into account • Techniques to compute answers • Take into account agglomerative scoring • Top-k and top-few Amélie Marian - Rutgers University

  24. Outline • Answer Corroboration for Data Cleaning • Motivations • Multiple Join Path Framework • Our Approach • Experimental Evaluation • Answer Corroboration for Web Search • Motivations • Our Approach • Challenges Amélie Marian - Rutgers University

  25. Motivations • Information on web sources is unreliable • Erroneous • Misleading • Biased • Outdated • Users check many web sites to confirm the information • Data corroboration • Can we do that automatically to save time? Amélie Marian - Rutgers University

  26. Query: “honda civic 2005 gas mileage” on MSN Search Is the top hit; the carhybrids.com site trustworthy? Is the Honda web site unbiased? Are all these values refering to the correct year of the model? Example: What is the gas mileage of my Honda Civic Users may check several web sites to get an answer Amélie Marian - Rutgers University

  27. Combines similar values Use frequency of the answer as the ranking measure (out of the first 10 pages; one page had no answer) Example: Aggregating Results using Data Corroboration Amélie Marian - Rutgers University

  28. Challenges • Designing a meaningful ranking function • Frequency of the answer in the result set • Importance of the web pages containing the answer • As measured by the search engine (e.g. Pagerank) • Importance of the answer within the page • Use of formatting information within the page • Proximity of the answer to query term • Multiple answers per page • Similarity of the page with other pages • Dampening factor • Reduce the impact of copy-paste sites • Reduce the impact of pages from same domain Amélie Marian - Rutgers University

  29. Challenges (cont.) • Selecting the result set (web pages) • How deep in the search engine result are we going? • Low ranked page will not contribute much to the score: use top-k pruning techniques • Extracting information from the web page • Use existing Information Extraction (IE) and Question Answering (QA) techniques Amélie Marian - Rutgers University

  30. Current work • Focus on numerical queries • Analysis of MSN queries show that they have a higher clickthrough rate than general queries • Answer easier to identify in the text • Scoring function • Currently a simple aggregation of individual parameter scores • Working on a probabilistic approach • Number of page accessed • Dynamic selection based on score information Amélie Marian - Rutgers University

  31. Evaluation • 15 million query logs from MSN • Focus on: • Queries with high clickthrough rate • Numerical value queries (for now) • Compare clickthrough with best-ranked sites to measure precision and recall • User studies Amélie Marian - Rutgers University

  32. Interface Amélie Marian - Rutgers University

  33. Related work • Web Search • Our interface is build on top of a standard search engine • Question Answering Systems (START, askMSR, MULDER) • Some have used frequency of answer to increase score (askMSR, MULDER) • We are considering more complex scoring mechanisms • Information Extraction (Snowball) • We can use existing technique to identify information within a page • Our problem is much simpler than standard IE • Top-k queries (TA, Upper, MPro) • We need some pruning functionalities to stop retrieving web search results Amélie Marian - Rutgers University

  34. Conclusions • Large amount of low-quality data • Users have to rummage through a lot of information • Data corroboration can improve the quality of query results • Has not been used much in practice • Makes sense in many applications • Standard ranking techniques have to be modified to handle corroborative scoring • Standard ranking scored each answer individually • Corroborative ranking combines answer • Pruning conditions in top-k queries do not work on corroborative answers Amélie Marian - Rutgers University

More Related