370 likes | 516 Views
Probabilistic Ranking of Database Query Results. Surajit Chaudhuri , Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis , Florida International University Gerhard Weikum , MPI Informatik. Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA. Roadmap.
E N D
Probabilistic Ranking of Database Query Results SurajitChaudhuri, Microsoft Research Gautam Das, Microsoft Research VagelisHristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi SivaramakrishnanSubramani CSE@UTA
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking
Example – Realtor Database • House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year • Query: City =`Seattle’ AND Waterfront = TRUE • Too Many Results! • Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable
Rank According to Unspecified Attributes Score of a Result Tuplet depends on • Global Score: Global Importance of Unspecified Attribute Values [CIDR2003] • E.g., Newer Houses are generally preferred • Conditional Score: Correlations between Specified and Unspecified Attribute Values • E.g., Waterfront BoatDock Many Bedrooms Good School District
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Key Problems • Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR). • How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
PIR Review • Bayes’ Rule • Product Rule • Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
Adaptation of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained
Limited Independence Assumptions • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Use Workload Use Data
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Scan Algorithm Preprocessing - Atomic Probabilities Module • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-K Tuples
Beyond Scan Algorithm • Scan algorithm is Inefficient Many tuples in the answer set • Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples
Output from Index Module • CondListCx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) • GlobListGx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore)
Preprocessing Component Preprocessing • For Each Distinct Value x of Database, Calculate and Store the Conditional (Cx) and the Global (Gx) Lists as follows • For Each Tuplet Containing x Calculate and add to Cx and Gx respectively • Sort Cx, Gx by decreasing scores Execution Query Q: X1=x1 AND … AND Xs=xs • Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Experimental Setup • Datasets: • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) • Internet Movie Database (http://www.imdb.com) • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO
Quality Experiments • Conducted on Seattle Homes and Movies tables • Collect a workload from users • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]
Quality Experiment-Average Precision • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples • Let each user mark 10 tuples in Hi as most relevant to Qi • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
Quality Experiment- Fraction of Users Preferring Each Algorithm • 5 new queries • Users were given the top-5 results
Performance Experiments • Datasets • Compare 2 Algorithms: • Scan algorithm • List Merge algorithm
Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems
Conclusions – Future Work Conclusions • Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations • Based on PIR Drawbacks • Mutiple-table query • Non-categorical attributes Future Work • Empty-Answer Problem • Handle Plain Text Attributes