150 likes | 292 Views
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Presented by Archana vijayalakshmanan 4/11/2006. Contents. Introduction Different ranking functions Breaking ties Implementation Conclusion. Introduction.
E N D
Automated Ranking of Database Query ResultsSanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Archana vijayalakshmanan 4/11/2006
Contents • Introduction • Different ranking functions • Breaking ties • Implementation • Conclusion
Introduction • Automated ranking of the results of the query is popular aspect of IR. • Database system support only a boolean query model. • Empty answers • Many answers • Automated ranking of query results is taking user query and mapping to Top-K query with ranking function.
Automated Ranking functions for the ‘Empty Answers Problem’ • IDF Similarity • QF Similarity • QFIDF Similarity
w IDF Similarity <attribute,value> tuple d • Database(only categorical attribute) T=<t1,……tm> • Q=<q1,…...qm> Condition is “WHERE is A1=q1” • IDFk(t)=log(n/Fk(t)) • n-number of tuples in database • Fk(t) -Frequency of tuples in database where Ak=t • Similarity between T and Q is • Sum of corresponding similarity coefficients over all attributes • dot product is un-normalized • TF is irrelavant • Similarity function known as IDF similarity • Eg query={CONVERTIBLE,NISSAN} • IR technique Q=set of key words IDF(w)=log(N/F(w)) TF(w,d)=Frequency of occurance of w in d Cosine similarity between query and document is normalized dot product of the two corresponding vector Similarity function known as cosine similarity with TF-IDF weightings
Generalizations of IDF similarity • For numeric data • Inappropriate to use previous similarity coefficients. • frequency of numeric value depends on nearby values. • Discretizing numeric to categorical attribute is problematic. • Solution: • {t1,t2…..tn} be the values of attribute A.For every value t, sum of”contributions” of t from every other point ti contributions modeled as gaussian distribution • Similarity function is bandwidth parameter • For range/set of values
QF Similarity • Importance of attribute values is determined by frequency of their occurence in workload • For categorical data • query frequency QF(q)= rawfrequency of occurrence of value q of attribute A in query strings of workload (RQF(q) raw frequency of most frequently occuring value in workload (RQFMax) • s(t,q)= QF(q), if q=t 0 , otherwise • Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(TOYOTA,HONDA), • Analyzing IN clauses of queries: If certain pair of values often occur together in the workload ,they are similar .e.g queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}” • Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA.
QFIDF Similarity • QF is purely workload-based. Big disadvantage for insufficient or unreliable workloads. • For QFIDF Similarity • S(t,q)=QF(q) *IDF(q) when t=q where QF(q)=(RQF(q)+1)/(RQFMax+1). • Thus we get small non zero value even if value is never referenced in workload model
Breaking ties • Problem: Many tuples may tie for the same similarity score and get ordered arbitarily.Arise in empty and many answers problem. • Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information. • Extend QF similarity ,use quantity to break ties. • Extending IDF similarity by using IDF values presents challenges.
Implementation • Pre-processing component • Query–processing component
Pre-processing component • Compute and store a representation of similarity function in auxiliary database tables. • For categorical data, compute IDF(t) (resp QF(t)) ,to compute frequency of occurences of values in database and store the results in auxillary database tables. • For numeric data, an approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value is retrieved at runtime.
Query processing component • Main task: Given a query Q and an integer K, retrieve Top-K tuples from the database using one of the ranking functions. • Ranking function extracted in pre-processing phase. • SQL-DBMS for solving top-K problem. • Handling simpler query processing problem • Input: table R with M categorical columns, Key column TID, C is conjunction of form Ak=qk..... and integer K. • Output: top-K tuples of R similar to Q. • Similarity function: Overlap Similarity.
Implementation of Top-K operator • Traditional approach • Indexed based approach • overlap similarity function satisfies the following monotonic property. Adapt TA algorithm If T and U are two tuples such that for all K, Sk(tk,qk)< Sk(uk,qk) then SIM(T,Q) < SIM(U,Q) • To adapt TA implemented Sorted and random access methods. • Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by random access and maintains buffer of Top-K tuples seen so far.
Indexed-based TA(ITA) Sorted access Random access
Conclusion • Thus TF-IDF based techniques were extended to numerical and mixed data. • Workload tracking was used as a weak form of collaborative filtering.