140 likes | 207 Views
Query Flocks: A Generalization of Association-Rule Mining. D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal. Motivations. Market basket analysis has been successful, partially due to the a-priori optimization Extend this trick to a more general context
E N D
Query Flocks: A Generalization of Association-Rule Mining D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal
Motivations • Market basket analysis has been successful, partially due to the a-priori optimization • Extend this trick to a more general context • efficiently mine large databases for patterns • use parametrized queries with a filter condition • spend most of the time evaluating the “interesting” cases
Query Flocks • Two parts: • generate parametrized queries (parameters are denoted by names starting with $) • filter the results of the queries • Result is the set of tuples which are “acceptable” assignments of values for the parameters
Market Basket Example Datalog query: answer(B) :- baskets(B,$1) AND baskets(B,$2) Filter: COUNT(answer.B) >= 20 • Find all pairs of items that appear in at least 20 market baskets • Result is all pairs of items ($1,$2) such that at least 20 baskets have both items
The same query in SQL: SELECT i1.Item, i2.Item FROM baskets i1, baskets i2 WHERE i1.Item < i2.Item AND i1.BID = i2.BID GROUP BY i1.Item, i2.Item HAVING 20 <= COUNT(i1.BID) A-Priori trick is not implemented by conventional optimizers Claim: necessary code optimizations could be implemented in SQL systems Why Not SQL?
Generalizing the A-Priori Technique • First evaluate a less expensive query and eliminate certain answers • Use a subset of the subgoals of the query • This subset must form a safe query
Safe Queries • A variable in the head appears in a nonnegated, nonarithmetic subgoal • A variable in a negated subgoal appears in a nonnegated subgoal • A variable in an arithmetic subgoal appears in a nonnegated, nonarithmetic subgoal
Relations: diagnosed(patient, disease) exhibits(patient, symptom) treatments(patient, medicine) causes(disease, symptom) Query: answer(P) :- exhibits(P,$s) AND treatments(P,$m) AND diagnosed(P,D) AND NOT causes(D,$s) Find symptoms $s and medicines $m such that many (at least 20) patients exhibit the symptom and are taking the medicine, but their disease does not explain the symptom Example
Some Safe Subqueries • answer(P) :- exhibits(P,$s). 20+ patients exhibit the symptom • answer(P) :- treatments(P,$m). 20+ patients were given the medicine • answer(P) :- diagnosed(P,D) AND exhibits(P,$s) AND NOT causes(D,$s). 20+ patients have an unexplained symptom • answer(P) :- exhibits(P,$s) AND treatments(P,$m). 20+ patients are taking the medicine and exhibit the symptom
A Formal Query Plan Using A Sequence of Filter Steps okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20); okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20); ok($s,$m) := FILTER({$s,$m}, answer(P) :- okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(p,$m) AND NOT causes(D,$s), COUNT(answer.P) >= 20);
But Which Subqueries Are Best? • Depends on sizes of relations, and numbers of patients, diseases, etc. • Use heuristics for restricting the search for a query plan
A Dynamic Technique • Use the sizes of the intermediate relations, after computation, to decide whether to filter • if the relation size gives an average number of tuples per value assignment that is much lower than previous steps, filter • if the set of parameters has not been seen before, compare number of tuples per value assignment with support threshold
1. Compare number patients with number symptoms 2. Compare number patients with number medicines 3. Compare size of relation with symptoms * medicines 4. Compare number patients in relation from 3 with number patients from leaf 5. Must be done to get query result Example 5 4 NOT causes(D,$s) 3 diagnosed(P,D) 1 2 exhibits(P,$s) treatments(P,$m)
Summary • This is a way of describing operations on large-scale databases • flocks consist of parametrized queries and filters for the results of the queries • exploit the a-priori algorithm with subqueries • use techniques for limiting the search for query plans