Probabilistic Information Retrieval Approach for Ranking of Database Query Results

Probabilistic Information Retrieval Approach for Ranking of Database Query Results Presenter: KetakiGadre Authors: SURAJITCHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM

Introduction (1/4) • Database Query Retrieval Model • Boolean Model • Many-Answers Problem: Too many tuples when query is not too selective

Introduction (2/4) • Realtor Database • Each tuple represents home for sale in US • Query: select * from Homes where City=Seattle and View=Waterfront

Introduction (3/4) • Many-Answers Problem in IR • Query reformulation techniques: User prompted to refine query • Automatic ranking: Query results ranked by degree of relevance

Introduction (4/4) • Many-Answers Problem in Database • Query: select * from Homes where City=Seattle and View=Waterfront • Automatic Ranking: Preferable to first return homes with other desirable attributes like good SchoolDistrict, BoatDocks etc.

Approach (1/3) • Look beyond attributes specified in query • Ranking function of tuple based on: • Global Score: Captures global importance of unspecified attribute values • Conditional Score: Strengths of dependencies between specified and unspecified attribute values

Approach(2/3) • Query: select * from Homes where City=Seattle and View=Waterfront • SchoolDistrict=Excellent • Globally desirable High rank • BoatDock=Yes • People desiring waterfront likely to desire boat dock  High rank

Approach (3/3) • Challenge: Translate these intuitions into principled ranking function • Proposed Solution: Probabilistic IR • Why PIR? • Can extend to model data dependencies and correlations

Outline • Adaption of PIR to Structured Data • Special Cases • Generalizations • Implementation • Experiments

Probabilistic IR - Overview Query Representation User Information Need How to match? Document Collection Document Representation

Probabilistic IR - Overview • Boolean and Vector Space Model: • Query-Document matching is done using index terms in query and document • Probabilistic IR: • Probability theory is used toestimate how likely it is that a document is relevant to a query • The goal is the estimation of the probability of relevance of document

Some Probability Formulae • Bayes’ Rule: • Product Rule:

Probabilistic IR - Notations : Document collection : Fixed query : Set of relevant documents : Set of irrelevant documents : The probability of the relevance of : The probability of the non-relevance of

Probabilistic IR – Ranking Function • Rank documents by their odds of relevance  • Gives same ranking & we can ignore the common denominator • and are the same for every document and thus are constants =

How to adapt Probabilistic IR model for structured databases?

Notations : Database table tuples attributes Query : select * from where and … and Specified attributes, Unspecified attributes, Answer set,

Types of Queries • Point query: select * from where and … and • IN Query: select * from where IN(… )and … and IN( … ) • Range Query: select * from where Sqft BETWEEN (2500, 3000)

Adaptation for Structured Data (1/2) • Each tuple is treated as document • : subset of values corresponding to attributes in • : remaining subset of values corresponding to attributes in or

Adaptation for Structured Data (2/2) = . • Approximate as D • All relevant tuples have same values specified in query .

Limited Independence Assumption • Given a query and a tuple  • The (and ) values within themselves are assumed to be independent • But dependencies between the and values are allowed = .

Presence of Functional Dependencies • If attributes are related through FDs, we derive equation without making limited independence assumption

Functional Dependencies • Constraints derived from the meaning and interrelationships of the data attributes • If the value of determines a unique value for • i.e. whenever two tuples have the same value for , they must have the same value for • e.g.

Presence of FDs (1/4) • FDs apply only to data, not to workload • E.g. • Applicable in data • Query Q in workload that specifies a requested Zipcode may not have specified City

Presence of FDs (2/4) • FD between attributes in , where,

Presence of FDs (3/4) • Similarly, FDs between attributes in X where,

Presence of FDs (4/4)

R is unknown • How to estimate 𝑝(𝑦|𝑅)?

Workload-Based Estimation (1/6) • Workload : collection of queries executed in the past select * from Homes where City=Kirkland and Price=High • may reveal some patterns  • Large fraction of users that had requested for high-priced homes in Kirkland had also requested for waterfront views • Data does not indicate these user preferences

Workload-Based Estimation (2/6) Approximate as all query tuples in that also request for

Workload-Based Estimation (3/6) • Replace with

Workload-Based Estimation (4/6) • Applying Bayes’ rule for ,

Workload-Based Estimation (5/6) • Dropping constant , Global Conditional

Workload-Based Estimation (6/6) • Considering functional dependencies, Global Conditional

Calculating Atomic Probabilities (1/2) • Pre-compute atomic quantities , , and for all distinct values in database • and : • Relative frequencies of each distinct value in workload and database

Calculating Atomic Probabilities (2/2) • and : • Compute confidences of pairwise association rules in the workload and database • “Association rule has confidence if of transactions in database that contain also contain ”

Special Cases • Ranking function in absence of workload • Ranking function assuming no dependencies between attributes

Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and

Ranking Function in Absence of Workload • Assume that: • is same for all distinct values • and is same for all pairs of distinct values and .

Ranking Function Assuming No Dependencies Between Attributes • Make independence assumption between all attributes

Ranking Function Assuming No dependencies Between Attributes • Make independence assumption between all attributes

Generalizations • IN Queries • IN conditions in the Query • IN conditions in the Workload • Numeric Attributes • Estimating and • Multi-table Database

IN Queries • Generalization of point queries select * from where City IN(Kirkland, Redmond)and Price IN(High, Moderate) • Challenge: two tuples that satisfy the query condition may differ in their specific values

IN Conditions in the Query (1/4) • Recall equation, = .

IN Conditions in the Query (2/4) • Score function with IN conditions in the queries

IN Conditions in the Query (3/4) • Score function for point queries

IN Conditions in the Query (4/4) • Extra factor needed to be multiplied • Equivalent equation is, Conditional Global

IN Conditions in the Workload • Conceptually expand the workload • Split each IN query into point queries • Query: City IN (Bellevue, Redmond, Carnation) AND Price IN (High, Moderate) • Split into 3×2=6 point queries

Numeric Attributes (1/2) • Query: Age BETWEEN (5, 10) AND Sqft BETWEEN (2500, 3000) • Simple approach: • Treat numerical value as categorical value • Convert queries with range conditions to queries with IN conditions • Problem: • Many distinct values are not adequately represented in the workload

Probabilistic Information Retrieval Approach for Ranking of Database Query Results

Probabilistic Information Retrieval Approach for Ranking of Database Query Results

Presentation Transcript

Probabilistic Information Retrieval Part I: Survey

Wavelets and Ranking of database query results

Ranking of Database Query Results

Ranking in Information Retrieval Systems

Probabilistic Ranking of Database Query Results

Automated Ranking Of Database Query Results

Automated Ranking Of Database Query Results

Information Retrieval - Query expansion

Probabilistic Ranking of Database Query Result

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Probabilistic Information Retrieval

Lecture 11: Probabilistic Information Retrieval

Probabilistic Ranking of Database Query Results

Information Retrieval - Query expansion

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources

Query Ranking in Probabilistic XML Data

Scrubbing Query Results from Probabilistic Databases

Probabilistic Ranking of Database Query Results

Chapter 11 Probabilistic Information Retrieval

Information Retrieval - Query expansion