1 / 31

Probabilistic Ranking of Database Query Results

Explore a probabilistic ranking model for database queries, addressing the many-answers problem with query reformulation and automatic ranking. Learn about the architecture, ranking functions, experiments, and conclusions.

eufemia
Download Presentation

Probabilistic Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by: Kiran Karnam

  2. Agenda • Introduction & Motivation • Problem Definition • Architecture • Ranking Function • Implementation • Experiments • Conclusions & Limitations

  3. Introduction and Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking

  4. Example • Many answers problem SELECT * FROM REALTOR_DB WHERE CITY=‘SEATTLE’ ;

  5. Proposed Solutions • Query reformulation • Automatic ranking

  6. Problem Defination • Specified Attributes city • Unspecified Attributes View School District Boat Dock

  7. Global Score: Global score which captures the global importance of unspecified attribute values. Eg: VIEW=‘WATERFRONT’ • Conditional Score: which captures the strengths of dependencies (or correlations) between specified and unspecified attribute values. Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’

  8. Architecture

  9. Ranking Function • Important Rules and Theorem required • Bayes’ Rule: p(a/b) = [ p(b/a) p(a) ] / [p(b)] • Product Rule: p(a,b/c) = p(a/c) * p(b/a,c)

  10. Bayes Rule • Bayes theorem shows the relation between two conditional probabilities which are the reverse of each other • The probability of an event A given an event B depends not only on the relationship between events A and B but on the marginal probability (or "simple probability") of occurrence of each event

  11. PIR • Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

  12. Adaption of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained

  13. Cont….

  14. Limited Conditional Independence • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

  15. Estimation using Workload • If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X. • Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high. • Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.

  16. Cont… • Final Ranking Formula is Where: p(y|W) = Relative frequency of unspecified attribute ‘y’ given workload ‘W’ p(y|D)= Relative frequency of unspecified attribute ‘y’ given data base ‘D’ p(x|y,W)=Frequency of correlation between x and y in W P(x|y,D)=Frequency of correlation between x and y in D

  17. Implementation

  18. Different Stages • Pre processing • Atomic probability module • Index module • Intermediate Knowledge Reference layer • Query processing • Scan algorithm • List merge algorithm

  19. Preprocessing • Computation of modules: p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for all distinct values of x and y. • Storing these atomic probabilities as database tables in intermediate knowledge representation layer with appropriate indexes. • Computation of index module resulting in conditional and global lists table.

  20. Cont… • CONDITIONAL LISTS Cx: Contains <TID, CondScore> in descending order • GLOBAL LISTS Gx: Contains <TID,GlobScore> in descending order

  21. Scan Algorithm • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-KTuples Scan algorithm is Inefficient Many tuples in the answer set • Another approach Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples

  22. Index Module

  23. Query Processing Component

  24. Experiment • Databases Used • MSN Home Advisor database (http://houseandhome.msn.com/) • Internet Movie Database • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO

  25. Type of Experiments • Quality Experiments • Performance Experiments

  26. Quality Experiments Query: select * from SeattleHomes where City=‘Seattle’ and Bedroom=1; • Conditional ranked condos with garages the highest • Global failed to recognize importance of the unspecified attribute Garage=‘Y’

  27. Cont… • User preference of rankings • 5 new queries • Users were given the top-5 results

  28. Performance Experiments • Compare 2 algorithms • Scan algorithm • List Merge algorithm

  29. Execution time of performance algorithms

  30. Conclusion • Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations LIMITATION: Existence of correlations between text and non-text data. Future Work • Empty-Answer Problem • Handle Plain Text Attributes

  31. References • Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum, Probabilistic Ranking of Database Query Results, VLDB 2004. • users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt • http://crystal.uta.edu/~cse6339/Fall08DBIR.htm • http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

More Related