Probabilistic Ranking of Database Query Results

Probabilistic Ranking of Database Query Results SurajitChaudhuri, Microsoft Research Gautam Das, Microsoft Research VagelisHristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by: Ranjanalankarraju Sindhusatyanarayana

AGENDA • Introduction & Motivation • Problem Definition & Architecture • Definition of Ranking Function • Implementation • Experiments • Conclusions & Limitations

LET US SEE THE • Introduction & Motivation • Problem Definition & Architecture • Definition of Ranking Function • Implementation • Experiments • Conclusions & Limitations

Introduction and Motivation REALTOR_DB

PROBLEM DEFINITION- MANY ANSWERS • SELECT * FROM REALTOR_DB WHERE CITY=‘SEATTLE’ ; RESULT OF THIS QUERY: Too Many Answers

PROPOSED SOLUTIONS • QUERY REFORMULATION TECHNIQUES: -BY PROMPTING THE USER • AUTOMATIC RANKING: -USING GLOBAL AND CONDITIONAL SCORE

DEFINITIONS AND SYMBOLS • What are Specified Attributes (Denoted as ‘X’) • City • What are Unspecified Attributes (Denoted as ‘Y’) • View • Price • SchoolDistrict • BoatDock

PROPOSED RANKING FUNCTION • Global Score : Global importance of unspecified attributes Eg: VIEW=‘WATERFRONT’ • Conditional Score: Correlations between specified and unspecified attributes Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’ Will BOATDOCK=‘YES’ interest him?

ARCHITECTURE

RANKING FUNCTIONSRules & Theorems For PIR • Bayes’ Rule: p(a/b) = [ p(b/a) p(a) ] / [p(b)] Product Rule: p(a,b/c) = p(a/c) * p(b/a,c)

BAYES’ THEOREM EXAMPLE • 1% of the population has X disease.. A screening test accurately detects the disease for 90% of people with it. The test also indicates the disease for 15% of the people without it ( the false positives). Suppose a person screened for the disease tests positive. What is the probability they have it?

BAYES’ THEOREM Cont… • Interpretation and Assumption: D - Event that person has disease T- Test is Positive • Given: p(D)= 1% p(D|T)=? p(T|D) = 90 % p(T|D’)=15%

Tree structure Interpretation Four Cases 1. (D n T)-Has disease and test +ve. 3. (D’ n T)- No disease and test +ve. 2. (D n T’)-Has disease and test –ve. 4. (D’ n T’)- No disease and test –ve. 1 D’ D T T T’ T’

Rules & Theorems For PIR cont… t-Tuple (Document) R-Relevant Documents R- Irrelevant Documents

Adaptation of PIR • Partition tuple ‘t’ into two parts t(X) and t(Y) • Replacing t with ‘X’ & ‘Y’

Adaptation of PIR cont… • QUERY SPECIFIED BY USER: Select * From Realtor_db where City=‘Seattle’ and Price=‘High’; • FINAL RANKING: • Waterfront Views • Greenbelt Views • Street Views

Limited Independence Assumption • X (and Y) values within themselves are assumed to be independent. • Dependencies between the X and Y values are allowed

Eliminating R Incoming Query: Select * from Realtor_db where City=‘Seattle’;

Workload-Based Estimation FINAL RANKING FORMULA Where: p(y|W) = Relative frequency of unspecified attribute ‘y’ given workload ‘W’ p(y|D)= Relative frequency of unspecified attribute ‘y’ given data base ‘D’ p(x|y,W)=Frequency of correlation between x and y in W P(x|y,D)=Frequency of correlation between x and y in D

Detailed Process

IMPLEMENTATION • Preprocessing: 1. Computation of modules: p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for all distinct values of x and y. 2. Storing these atomic probabilities as database tables in intermediate knowledge representation layer with appropriate indexes. 3.Computation of index module resulting in conditional and global lists table.

IMPLEMENTATION cont… CONDITIONAL LISTS Cx: Contains <TID, CondScore> in descending order GLOBAL LISTS Gx: Contains <TID,GlobScore> in descending order

IMPLEMENTATION cont…

Conditional and Global Scores

Conditional and Global List tables

IMPLEMENTATION cont… • Query Processing Component.

List Merge Algorithm contd...

EXPERIMENTS • Datasets: • MSN HomeAdvisor database • Internet Movie Database(IMDB)

Quality Experiments • Examples of Ranking Results: Query: select * from SeattleHomes where City=‘Seattle’ and Bedroom=1; • Conditional ranked condos with garages the highest • Global failed to recognize importance of the unspecified attribute Garage=‘Y’

Quality Experiments • User Preference of Rankings: • Users given top 5 results of rankings for 5 queries • Ranking preferred by users indicated below:

CONCLUSION & LIMITATION CONCLUSION: Automated approach leverages data and workload statistics and correlations. LIMITATION: Existence of correlations between text and non-text data.

Probabilistic Ranking of Database Query Results