310 likes | 321 Views
Explore a probabilistic ranking model for database queries, addressing the many-answers problem with query reformulation and automatic ranking. Learn about the architecture, ranking functions, experiments, and conclusions.
E N D
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by: Kiran Karnam
Agenda • Introduction & Motivation • Problem Definition • Architecture • Ranking Function • Implementation • Experiments • Conclusions & Limitations
Introduction and Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking
Example • Many answers problem SELECT * FROM REALTOR_DB WHERE CITY=‘SEATTLE’ ;
Proposed Solutions • Query reformulation • Automatic ranking
Problem Defination • Specified Attributes city • Unspecified Attributes View School District Boat Dock
Global Score: Global score which captures the global importance of unspecified attribute values. Eg: VIEW=‘WATERFRONT’ • Conditional Score: which captures the strengths of dependencies (or correlations) between specified and unspecified attribute values. Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’
Ranking Function • Important Rules and Theorem required • Bayes’ Rule: p(a/b) = [ p(b/a) p(a) ] / [p(b)] • Product Rule: p(a,b/c) = p(a/c) * p(b/a,c)
Bayes Rule • Bayes theorem shows the relation between two conditional probabilities which are the reverse of each other • The probability of an event A given an event B depends not only on the relationship between events A and B but on the marginal probability (or "simple probability") of occurrence of each event
PIR • Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
Adaption of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained
Limited Conditional Independence • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Estimation using Workload • If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X. • Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high. • Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.
Cont… • Final Ranking Formula is Where: p(y|W) = Relative frequency of unspecified attribute ‘y’ given workload ‘W’ p(y|D)= Relative frequency of unspecified attribute ‘y’ given data base ‘D’ p(x|y,W)=Frequency of correlation between x and y in W P(x|y,D)=Frequency of correlation between x and y in D
Different Stages • Pre processing • Atomic probability module • Index module • Intermediate Knowledge Reference layer • Query processing • Scan algorithm • List merge algorithm
Preprocessing • Computation of modules: p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for all distinct values of x and y. • Storing these atomic probabilities as database tables in intermediate knowledge representation layer with appropriate indexes. • Computation of index module resulting in conditional and global lists table.
Cont… • CONDITIONAL LISTS Cx: Contains <TID, CondScore> in descending order • GLOBAL LISTS Gx: Contains <TID,GlobScore> in descending order
Scan Algorithm • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-KTuples Scan algorithm is Inefficient Many tuples in the answer set • Another approach Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples
Experiment • Databases Used • MSN Home Advisor database (http://houseandhome.msn.com/) • Internet Movie Database • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO
Type of Experiments • Quality Experiments • Performance Experiments
Quality Experiments Query: select * from SeattleHomes where City=‘Seattle’ and Bedroom=1; • Conditional ranked condos with garages the highest • Global failed to recognize importance of the unspecified attribute Garage=‘Y’
Cont… • User preference of rankings • 5 new queries • Users were given the top-5 results
Performance Experiments • Compare 2 algorithms • Scan algorithm • List Merge algorithm
Conclusion • Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations LIMITATION: Existence of correlations between text and non-text data. Future Work • Empty-Answer Problem • Handle Plain Text Attributes
References • Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum, Probabilistic Ranking of Database Query Results, VLDB 2004. • users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt • http://crystal.uta.edu/~cse6339/Fall08DBIR.htm • http://crystal.uta.edu/~cse6339/Fall09DBIR.htm