200 likes | 437 Views
Database Searching and Information Retrieval. Presented by: Tushar Kumar.J Ritesh Bagga. Background. Motivation
E N D
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga
Background • Motivation The main motivation behind choosing this topic was our interest in expanding the knowledge about the database and also due to the support which it will provide to our research work. • Focus Our focus is on the various algorithms employed to retrieve top few results from the database. This is one of the most exciting field in database recently.
Introduction to Problem • Most often we query single database. • At times we need to query multiple databases with heterogeneous data. • Difficult for user to write a single sql-query to work on all database. • Solution : develop a middleware system to work on top of these subsystems. • This middleware divides the query into sub queries and run them on each individual subsystem.
Introduction to Problem User Query (Color = “Red”) AND (Shape=“Circle”) Middleware System (We will study algorithms which run on this middleware) Shape = “Circle” Color = “Red” “Redness” “Circle” R3 (1.00) R3 (0.70) R1 (1.00) R2 (0.50) R2 (0.00) R4 (0.40) R1 (0.10) R4 (0.00) Aggregation Function (MIN) Result
Framework of this presentation • Basic algorithms • Comparative study of basic algorithms • Modifications of TA algorithm • Advance algorithms • Related work • How web-search engines rank the web pages ? • Conclusion
Basic algorithms Fagin’s Algorithm • The most basic and original algorithm for solving the problem was developed by Ron Fagin, called as FA algorithm. • FA algorithm consists of following steps: • Sorted access in parallel to each of the ‘m’ lists. • Random access for every new object seen in every other list to find i th field x I of R. • Use aggregation function t(R) = t( xI , x 2 …….. xm) for every object to calculate over all grade and store it in set ‘Y’. • Define set ‘H’ containing objects seen is all the lists. • Stopping Point – Set ‘H’ has at least k objects. • Sort set ‘Y’ and output top k values.
Basic algorithmsFagin’s Algorithm Objects Seen R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 3.05 R5(1.00) R8(0.95) R10(1.00) R3(0.95) 3.40 R7(0.95) R2(0.90) R10(0.80) R3(0.95) 2.55 R4(0.70) R7(0.85) R5(0.85) R8(0.90) 3.05 R8(0.80) R3(0.80) R8(0.65) R2(0.85) R5(0.75) R4(0.80) R7(0.75) R7(0.60) 3.15 3.30 R3(0.70) R2(0.75) R9(0.70) R2(0.55) 2.05 R4(0.65) R9(0.50) R6(0.60) R1(0.65) 2.65 R1(0.60) R5(0.45) R1(0.50) R9(0.55) R4(0.40) R10(0.55) R6(0.45) R6(0.40) Objects seen in all 4 lists R6(0.50) R1(0.30) R9(0.30) R10(0.30) R8 R7 R2
Basic algorithms Threshold Algorithm • Similar to FA with slight modification. • TA algorithm consists of following steps: • Sorted access in parallel to each of the ‘m’ lists. • Random access for every new object seen in every other list to find i th field x I of R. • Use aggregation function t(R) = t( xI , x 2 …….. xm) for every object to calculate over all grade and store it in set ‘Y’ only if it belongs to top k objects. • Calculate threshold value ‘T’ of aggregate function after every sorted access. • Stopping Point – As soon as at least k objects have been seen whose grade is at least equal to ‘T”. • Return set ‘Y’ which has top k values.
Basic algorithmsThreshold Algorithm Top 3 Objects 3.90/4 R5(1.00) R8(0.95) R10(1.00) R3(0.95) R3(3.40/4) 3.60/4 R7(0.95) R2(0.90) R10(0.80) R3(0.95) R8(3.30/4) 3.30/4 R4(0.70) R7(0.85) R5(0.85) R8(0.90) R7(3.15/4) 3.10/4 R8(0.80) R3(0.80) R8(0.65) R2(0.85) R5(3.05/4) R5(0.75) R4(0.80) R7(0.75) R7(0.60) R2(2.95/4) R3(0.70) R2(0.65) R9(0.70) R2(0.55) R4(0.65) R9(0.50) R6(0.60) R10(2.65/4) R1(0.65) R1(0.60) R5(0.45) R1(0.50) R9(0.55) R4(0.40) R10(0.55) R6(0.45) R6(0.40) R6(0.50) R1(0.30) R9(0.30) R10(0.30)
Basic algorithmsComparison between TA and FA • FA is optimal in some cases, but TA is optimal in all the cases. • TA uses less buffer space, FA requires buffer that grows with the database size. • TA may do m-1 random access for every object not in top k set, but FA does this random access only once for every newly seen object in sorted access.
Modifications of TA Algorithms • Approximation Algorithm – to find the top k elements with ‘x’ degree of approximation. Stops earlier then TA. • Restricting Sorted Access – when sorted access to some lists are not allowed, e.g. finding best restaurant. • Restricting Random Access – • NRA was developed when no random access was allowed, e.g. text retrieval system. • CA was developed for situations where random access are allowed but are very costly. Is combination of TA and NRA, e.g. random disk access.
Advance algorithms • Suppose we already have several ranked lists of objects, the problem here is to aggregate these lists to form a single ranked list. • The problem can be solved using a median finding algorithm. • Steps involved in the median finding algorithm are - Find out the rank of each object in each of the ranked lists - Find the median of the ranks obtained from these lists for each object. - Sort the list containing the median ranks for these objects. - Retrieve the results from this list.
Advance algorithms • Limitation of the median finding algorithm is large number of random accesses, which is overcome by the MEDRANK algorithm. • MEDRANK algorithm – access the ranked lists, one element of every list at a time, until some element is seen in more than half of the lists.
Related work • In 1996, Chaudhuri and Gravano presented an algorithm which was built on Fagin’s original FA algorithm. • In 1997 and 1998, Carey and Kossmann presented techniques to optimize top-k queries. • In 1999, Nepal and Ramakrishna presented variations on Fagin’s TA algorithm for processing queries over multimedia databases. • In 2000, Guntzer made a remarkable contribution to the Fagin’s TA algorithm by reducing the number of random accesses. • In 2002, Chang and Zwang presented an algorithm called as MPro to optimize the execution of expensive predicates.
How web-search engines rank the web pages (1) • Web-search engines rank the web pages based on various factors. • Some of the most commonly found web-search engines are Frequency of occurrence and location are the primary factors. Two most important web-search engines – • Google and AltaVista
How web-search engines rank the web pages (2) • AltaVista - Maintains a huge phrase dictionary. - basic intuition behind the ranking of web pages is as follows • It first displays all the pages containing the phrase - Then it displays all the pages in which the words are closer to each other. - Followed by displaying all pages containing all the terms, displaying pages containing any of the terms - Another important factor is the popularity of search being performed.
How web-search engines rank the web pages (3) • Google - Uses a very different technology called as page-rank technology. • Page rank technology - Measures the importance of a web page by solving an equation. - Interprets a link as a vote. - Assesses a page’s importance by the no. of votes it receives. - Important pages receives a higher rank and appears at the top of the search results.
Conclusion • The literature studied signifies that much work is done to solve the problem of retrieving top-k results from the database. • We came across many algorithms which are very tricky to understand. • The research in this field is still very active. • Now the focus is on devising a more sophisticated algorithm for aggregating the ranked lists.
References [1] Ronald Fagin, “Combining Fuzzy Information from Multiple Systems” received July 4, 1996; revised June 22, 1998 [2] Ronald Fagin, “Combining Fuzzy Information: an Overview “, Appeared in ACM SIGMOD Record 31, 2, June 2002, pages 109-118 [3] Ronald Fagin, Amnon Lotem and Moni Naor. “Optimal aggregation algorithms for middleware” Computer and System Sciences 66 (2003), pp. 614-656. Extended abstract appeared in Proc. 2001 ACM Symposium on Principles of Database Systems (PODS '01), pp. 102-113. [4] Ronald Fagin, Ravi Kumar and D. Sivakumar. “Efficient similarity search and classification via rank Aggregation” Proc. 2003 ACM SIGMOD Conference (SIGMOD '03), pp. 301-312. [5] Ronald Fagin, Ravi Kumar, Mohammad Mahdian, D. Sivakumar, and Erik Vee. “Comparing and Aggregating Rankings with Ties” Proc. 2004 ACM Symposium on Principles of Database Systems (PODS '04), pp. 47-58. [6] Ronald Fagin, Ravi Kumar, and D. SivaKumar. “COMPARING TOP k LISTS” SIAM J. Discrete Mathematics 17, 1 (2003), pp. 134-160. Extended abstract in 2003 ACM-SIAM Symposium on Discrete Algorithms (SODA '03), pp. 28-36. [7] A. Marian, N. Bruno, and L. Gravano. “Evaluating Top- k Queries over Web-Accessible Databases” Accepted for publication in ACM Transactions on Database Systems, 2003. [8] Martin P. Courtois and Michael W.Berry, “Results Ranking in Web Search Engines” online may 1999.
Thank you! Any Questions?