1 / 103

Probabilistic Information Retrieval Approach for Ranking of Database Query Results

Probabilistic Information Retrieval Approach for Ranking of Database Query Results. Presenter: Ketaki Gadre. Authors: SURAJIT CHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM . Introduction (1/4). Database Query Retrieval Model Boolean Model Many-Answers Problem:

deron
Download Presentation

Probabilistic Information Retrieval Approach for Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Information Retrieval Approach for Ranking of Database Query Results Presenter: KetakiGadre Authors: SURAJITCHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM

  2. Introduction (1/4) • Database Query Retrieval Model • Boolean Model • Many-Answers Problem: Too many tuples when query is not too selective

  3. Introduction (2/4) • Realtor Database • Each tuple represents home for sale in US • Query: select * from Homes where City=Seattle and View=Waterfront

  4. Introduction (3/4) • Many-Answers Problem in IR • Query reformulation techniques: User prompted to refine query • Automatic ranking: Query results ranked by degree of relevance

  5. Introduction (4/4) • Many-Answers Problem in Database • Query: select * from Homes where City=Seattle and View=Waterfront • Automatic Ranking: Preferable to first return homes with other desirable attributes like good SchoolDistrict, BoatDocks etc.

  6. Approach (1/3) • Look beyond attributes specified in query • Ranking function of tuple based on: • Global Score: Captures global importance of unspecified attribute values • Conditional Score: Strengths of dependencies between specified and unspecified attribute values

  7. Approach(2/3) • Query: select * from Homes where City=Seattle and View=Waterfront • SchoolDistrict=Excellent • Globally desirable High rank • BoatDock=Yes • People desiring waterfront likely to desire boat dock  High rank

  8. Approach (3/3) • Challenge: Translate these intuitions into principled ranking function • Proposed Solution: Probabilistic IR • Why PIR? • Can extend to model data dependencies and correlations

  9. Outline • Adaption of PIR to Structured Data • Special Cases • Generalizations • Implementation • Experiments

  10. Probabilistic IR - Overview Query Representation User Information Need How to match? Document Collection Document Representation

  11. Probabilistic IR - Overview • Boolean and Vector Space Model: • Query-Document matching is done using index terms in query and document • Probabilistic IR: • Probability theory is used toestimate how likely it is that a document is relevant to a query • The goal is the estimation of the probability of relevance of document

  12. Some Probability Formulae • Bayes’ Rule: • Product Rule:

  13. Probabilistic IR - Notations : Document collection : Fixed query : Set of relevant documents : Set of irrelevant documents : The probability of the relevance of : The probability of the non-relevance of

  14. Probabilistic IR – Ranking Function • Rank documents by their odds of relevance  • Gives same ranking & we can ignore the common denominator • and are the same for every document and thus are constants =

  15. How to adapt Probabilistic IR model for structured databases?

  16. Notations : Database table tuples attributes Query : select * from where and … and Specified attributes, Unspecified attributes, Answer set,

  17. Types of Queries • Point query: select * from where and … and • IN Query: select * from where IN(… )and … and IN( … ) • Range Query: select * from where Sqft BETWEEN (2500, 3000)

  18. Adaptation for Structured Data (1/2) • Each tuple is treated as document • : subset of values corresponding to attributes in • : remaining subset of values corresponding to attributes in or

  19. Adaptation for Structured Data (2/2) = . • Approximate as D • All relevant tuples have same values specified in query .

  20. Limited Independence Assumption • Given a query and a tuple  • The (and ) values within themselves are assumed to be independent • But dependencies between the and values are allowed = .

  21. Presence of Functional Dependencies • If attributes are related through FDs, we derive equation without making limited independence assumption

  22. Functional Dependencies • Constraints derived from the meaning and interrelationships of the data attributes • If the value of determines a unique value for • i.e. whenever two tuples have the same value for , they must have the same value for • e.g.

  23. Presence of FDs (1/4) • FDs apply only to data, not to workload • E.g. • Applicable in data • Query Q in workload that specifies a requested Zipcode may not have specified City

  24. Presence of FDs (2/4) • FD between attributes in , where,

  25. Presence of FDs (3/4) • Similarly, FDs between attributes in X where,

  26. Presence of FDs (4/4)

  27. R is unknown • How to estimate 𝑝(𝑦|𝑅)?

  28. Workload-Based Estimation (1/6) • Workload : collection of queries executed in the past select * from Homes where City=Kirkland and Price=High • may reveal some patterns  • Large fraction of users that had requested for high-priced homes in Kirkland had also requested for waterfront views • Data does not indicate these user preferences

  29. Workload-Based Estimation (2/6) Approximate as all query tuples in that also request for

  30. Workload-Based Estimation (3/6) • Replace with

  31. Workload-Based Estimation (4/6) • Applying Bayes’ rule for ,

  32. Workload-Based Estimation (5/6) • Dropping constant , Global Conditional

  33. Workload-Based Estimation (6/6) • Considering functional dependencies, Global Conditional

  34. Calculating Atomic Probabilities (1/2) • Pre-compute atomic quantities , , and for all distinct values in database • and : • Relative frequencies of each distinct value in workload and database

  35. Calculating Atomic Probabilities (2/2) • and : • Compute confidences of pairwise association rules in the workload and database • “Association rule has confidence if of transactions in database that contain also contain ”

  36. Special Cases • Ranking function in absence of workload • Ranking function assuming no dependencies between attributes

  37. Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and

  38. Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and

  39. Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and .

  40. Ranking Function Assuming No Dependencies Between Attributes • Make independence assumption between all attributes

  41. Ranking Function Assuming No dependencies Between Attributes • Make independence assumption between all attributes

  42. Ranking Function Assuming No dependencies Between Attributes • Make independence assumption between all attributes

  43. Generalizations • IN Queries • IN conditions in the Query • IN conditions in the Workload • Numeric Attributes • Estimating and • Multi-table Database

  44. IN Queries • Generalization of point queries select * from where City IN(Kirkland, Redmond)and Price IN(High, Moderate) • Challenge: two tuples that satisfy the query condition may differ in their specific values

  45. IN Conditions in the Query (1/4) • Recall equation, = .

  46. IN Conditions in the Query (2/4) • Score function with IN conditions in the queries

  47. IN Conditions in the Query (3/4) • Score function for point queries

  48. IN Conditions in the Query (4/4) • Extra factor needed to be multiplied • Equivalent equation is, Conditional Global

  49. IN Conditions in the Workload • Conceptually expand the workload • Split each IN query into point queries • Query: City IN (Bellevue, Redmond, Carnation) AND Price IN (High, Moderate) • Split into 3×2=6 point queries

  50. Numeric Attributes (1/2) • Query: Age BETWEEN (5, 10) AND Sqft BETWEEN (2500, 3000) • Simple approach: • Treat numerical value as categorical value • Convert queries with range conditions to queries with IN conditions • Problem: • Many distinct values are not adequately represented in the workload

More Related