1.05k likes | 1.2k Views
Probabilistic Information Retrieval Approach for Ranking of Database Query Results. Presenter: Ketaki Gadre. Authors: SURAJIT CHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM . Introduction (1/4). Database Query Retrieval Model Boolean Model Many-Answers Problem:
E N D
Probabilistic Information Retrieval Approach for Ranking of Database Query Results Presenter: KetakiGadre Authors: SURAJITCHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM
Introduction (1/4) • Database Query Retrieval Model • Boolean Model • Many-Answers Problem: Too many tuples when query is not too selective
Introduction (2/4) • Realtor Database • Each tuple represents home for sale in US • Query: select * from Homes where City=Seattle and View=Waterfront
Introduction (3/4) • Many-Answers Problem in IR • Query reformulation techniques: User prompted to refine query • Automatic ranking: Query results ranked by degree of relevance
Introduction (4/4) • Many-Answers Problem in Database • Query: select * from Homes where City=Seattle and View=Waterfront • Automatic Ranking: Preferable to first return homes with other desirable attributes like good SchoolDistrict, BoatDocks etc.
Approach (1/3) • Look beyond attributes specified in query • Ranking function of tuple based on: • Global Score: Captures global importance of unspecified attribute values • Conditional Score: Strengths of dependencies between specified and unspecified attribute values
Approach(2/3) • Query: select * from Homes where City=Seattle and View=Waterfront • SchoolDistrict=Excellent • Globally desirable High rank • BoatDock=Yes • People desiring waterfront likely to desire boat dock High rank
Approach (3/3) • Challenge: Translate these intuitions into principled ranking function • Proposed Solution: Probabilistic IR • Why PIR? • Can extend to model data dependencies and correlations
Outline • Adaption of PIR to Structured Data • Special Cases • Generalizations • Implementation • Experiments
Probabilistic IR - Overview Query Representation User Information Need How to match? Document Collection Document Representation
Probabilistic IR - Overview • Boolean and Vector Space Model: • Query-Document matching is done using index terms in query and document • Probabilistic IR: • Probability theory is used toestimate how likely it is that a document is relevant to a query • The goal is the estimation of the probability of relevance of document
Some Probability Formulae • Bayes’ Rule: • Product Rule:
Probabilistic IR - Notations : Document collection : Fixed query : Set of relevant documents : Set of irrelevant documents : The probability of the relevance of : The probability of the non-relevance of
Probabilistic IR – Ranking Function • Rank documents by their odds of relevance • Gives same ranking & we can ignore the common denominator • and are the same for every document and thus are constants =
How to adapt Probabilistic IR model for structured databases?
Notations : Database table tuples attributes Query : select * from where and … and Specified attributes, Unspecified attributes, Answer set,
Types of Queries • Point query: select * from where and … and • IN Query: select * from where IN(… )and … and IN( … ) • Range Query: select * from where Sqft BETWEEN (2500, 3000)
Adaptation for Structured Data (1/2) • Each tuple is treated as document • : subset of values corresponding to attributes in • : remaining subset of values corresponding to attributes in or
Adaptation for Structured Data (2/2) = . • Approximate as D • All relevant tuples have same values specified in query .
Limited Independence Assumption • Given a query and a tuple • The (and ) values within themselves are assumed to be independent • But dependencies between the and values are allowed = .
Presence of Functional Dependencies • If attributes are related through FDs, we derive equation without making limited independence assumption
Functional Dependencies • Constraints derived from the meaning and interrelationships of the data attributes • If the value of determines a unique value for • i.e. whenever two tuples have the same value for , they must have the same value for • e.g.
Presence of FDs (1/4) • FDs apply only to data, not to workload • E.g. • Applicable in data • Query Q in workload that specifies a requested Zipcode may not have specified City
Presence of FDs (2/4) • FD between attributes in , where,
Presence of FDs (3/4) • Similarly, FDs between attributes in X where,
R is unknown • How to estimate 𝑝(𝑦|𝑅)?
Workload-Based Estimation (1/6) • Workload : collection of queries executed in the past select * from Homes where City=Kirkland and Price=High • may reveal some patterns • Large fraction of users that had requested for high-priced homes in Kirkland had also requested for waterfront views • Data does not indicate these user preferences
Workload-Based Estimation (2/6) Approximate as all query tuples in that also request for
Workload-Based Estimation (3/6) • Replace with
Workload-Based Estimation (4/6) • Applying Bayes’ rule for ,
Workload-Based Estimation (5/6) • Dropping constant , Global Conditional
Workload-Based Estimation (6/6) • Considering functional dependencies, Global Conditional
Calculating Atomic Probabilities (1/2) • Pre-compute atomic quantities , , and for all distinct values in database • and : • Relative frequencies of each distinct value in workload and database
Calculating Atomic Probabilities (2/2) • and : • Compute confidences of pairwise association rules in the workload and database • “Association rule has confidence if of transactions in database that contain also contain ”
Special Cases • Ranking function in absence of workload • Ranking function assuming no dependencies between attributes
Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and
Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and
Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and .
Ranking Function Assuming No Dependencies Between Attributes • Make independence assumption between all attributes
Ranking Function Assuming No dependencies Between Attributes • Make independence assumption between all attributes
Ranking Function Assuming No dependencies Between Attributes • Make independence assumption between all attributes
Generalizations • IN Queries • IN conditions in the Query • IN conditions in the Workload • Numeric Attributes • Estimating and • Multi-table Database
IN Queries • Generalization of point queries select * from where City IN(Kirkland, Redmond)and Price IN(High, Moderate) • Challenge: two tuples that satisfy the query condition may differ in their specific values
IN Conditions in the Query (1/4) • Recall equation, = .
IN Conditions in the Query (2/4) • Score function with IN conditions in the queries
IN Conditions in the Query (3/4) • Score function for point queries
IN Conditions in the Query (4/4) • Extra factor needed to be multiplied • Equivalent equation is, Conditional Global
IN Conditions in the Workload • Conceptually expand the workload • Split each IN query into point queries • Query: City IN (Bellevue, Redmond, Carnation) AND Price IN (High, Moderate) • Split into 3×2=6 point queries
Numeric Attributes (1/2) • Query: Age BETWEEN (5, 10) AND Sqft BETWEEN (2500, 3000) • Simple approach: • Treat numerical value as categorical value • Convert queries with range conditions to queries with IN conditions • Problem: • Many distinct values are not adequately represented in the workload