500 likes | 600 Views
SQAK: Doing More with Keywords. Sandeep Tata, Guy M. Lohman IBM Almaden Research Center Presented by Alex Zlotnik Seminar in Databases, 236826 azlotnik@tx.technion.ac.il. Aggregate Query A query that uses one of these functions: {Count, Average, Sum, Min, Max}.
E N D
SQAK:Doing More with Keywords Sandeep Tata, Guy M. Lohman IBM Almaden Research Center Presented by Alex Zlotnik Seminar in Databases, 236826 azlotnik@tx.technion.ac.il
Aggregate Query A query that uses one of these functions: {Count, Average, Sum, Min, Max}
Content • The problem • Solution • Research • Experiments • Other Challenges • Power vs. Ease of Use • Overview
3 Courses Professor Section Enrollment Courses Department Section Enrollment Student Section 4 courseid courseid deptid courseid courseid sectionid id id sectionid Join courseid name name sectionid studentid name sectionid studentid name name Join sectionid dept deptid deptid location term deptid term grade grade term instructor Section courseid sectionid term instructor The problem Number of students registered Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 course “Seminar in Databases” Fall semester in 2009 in less than 3 minutes 1 2
The problem 5. SQL: SELECT courses.name, section.term, count(students.id) as count FROM students, enrollment, section, courses WHERE students.id = enrollment.id AND section.classid = enrollment.classid AND courses.courseid = section.courseid AND lower(courses.name) LIKE ’\%seminar in databases\%’ AND lower(section.term) = ’\%fall 2009\%’ GROUP BY courses.name, section.term Number of students registered Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 course “Seminar in Databases” Fall semester in 2009 in less than 3 minutes
Perfect world solution SQAK: “Seminar in Databases” “Fall 2009” number students SQL: SELECT courses.name, section.term, count(students.id) as count FROM students, enrollment, section, courses WHERE students.id = enrollment.id AND section.classid = enrollment.classid AND courses.courseid = section.courseid AND lower(courses.name) LIKE ’\%seminar in databases\%’ AND lower(section.term) = ’\%fall 2009\%’ GROUP BY courses.name, section.term
Perfect world solution • Create aggregate queries using simple keywords • Little or no knowledge of the schema is required from the user • No changes required in the database • Any existing database
Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview
Generating SQL - Parser Parser (keywords) Keyword {Candidates | Candidate = (Table, Column)} • Matching keyword against schema elements with approximate string matching Example: “students” Enrollment.studentid “students” Student.id • Inverted index from all text values to their columns • Example: “Seminar in Databases” Courses.name • Aggregates: sum, count, avg, min, max Candidate Interpretations = Cross product of each Candidates list “Seminar in Databases” “Fall 2009” number students {(Course.name, Section.term, count Enrollment.studentid), (Course.name, Section.term, count Student.id )}
number student “Cohen” {(count, Student.Id, Student.Name), (count, Student.Name, Student.Name)} number student “Cohen” {(count, Student.Id, Student.Name), (count, Student.Id, Enroll.StudentId)} Generating SQL - Parser Parser (keywords) Initial Filtering - CI with 2 keywords corresponding to the same column - CI with 2 columns that are primary and foreign key
SQN Builder (Candidate Interpretations) For every candidate interpretation Build the best matching sub-graph of the schema Enrollment Section Students Courses Generating SQL – SQN Builder “Seminar in Databases” “Fall 2009” number “Alex” (Course.name, Section.term, count Student.name)
Generating SQL – Scorer Scorer (SQNs) Find the best SQN and create SQL for it
Building Sub-graph (SQN) • Input: Tables as nodes in the directed schema graph • Output: Connected sub-graph covering the tables • Principle: Simplest model Making fewest assumptions, used in other papers too • Attempt #1: Minimal covering sub-graph with directed path between every 2 tables
Enrollment Section Section Professor Students Course Building Sub-graph (SQN) • Attempt #1: Minimal covering sub-graph with directed path between every 2 tables • Problem: Many-to-Many relationships
Enrollment Section Students Professor Course Weak Reference Department Building Sub-graph (SQN) • Attempt #2: Minimal covering sub-graph with Node Clarity • Node Clarity: The sub-graph doesn’t contain any node with multiple incoming edges
Enrollment Section Students Students Course Course Weak Reference Department Building Sub-graph (SQN) • Example:Find the number of students per course • Query: courses count students For each course, list the number of students that are in the same department that offers the course Enrollment Section Students Professor Course Department
Building Sub-graph (SQN) • Input: Tables as nodes in the directed schema graph • Output: Minimal sub-graph with Node Clarity covering the tables • Observation: The output is a tree NP Complete Reduction from Exact 3-Cover
Building Sub-graph (SQN) Greedy Heuristic Algorithm for finding min SQN CI = {nodes (tables) of keywords} Qagg = Aggregate node SQN = {} Start BFS (non-directed) from the aggregate node, For every step i 1. Qi = Nodes discovered in step i 2. for every node q in CI Qi 2.1 If NodeClear(q.path SQN) 2.1.1 SQN SQN q.path 2.1.2 CI CI \ {q} 2.1.3 if (CI = {}) return SQN 3. If no progress was made 3.1 backtrack the added path 4. Stop when all nodes in CI where found or BFS finishes Minimality: By BFS
Building Sub-graph (SQN) Greedy Heuristic Algorithm for finding min SQN The algorithm finds minimal SQN Complexity: Without backtracking: O(q2E2) Otherwise, exponential Time limit: Stop the algorithm after fixed time and run without node clarity = approx. Steiner In this case SQAK warns the user that the result might not be accurate
Generating SQL – Scorer (reminding) Scorer (SQNs) Find the best SQN and create SQL for it Score(CI, SQN) =
Generating SQL Procedure makeSimpleStatement(CI, SQN) • Make SELECT clause from elements in CI • Make FROM clause from nodes in SQN • Make WHERE clause from edges in SQN • Make GROUP BY clause from elements of CI except aggregated node • Add predicates in CI to the WHERE clause • Return statement end procedure
Generating SQL • Input: CI, SQN • Output: SQL • 3 types of queries: • Simple: “Seminar in DB” count students • Top1, single level aggregate: “department with max num students” • Top1, double level aggregate: “department student with max avg grade”
Generating SQL translateSQN(CI, SQN) 1. if SQN does not have a w-node then 1.1 Return makeSimpleStatement(CI, SQN) end if 2. if SQN has a w-node and a single level aggregate then 2.1 Produce view u = makeSimpleStatement(CI,SQN) 2.2 Remove w-node from u’s SELECT clause and GROUP BY clause 2.3 r = makeSimpleStatement(CI, SQN) 2.4 Add u to r’s FROM clause 2.5 Add join conditions joining all the columns in u to the corresponding ones in r 2.6 return r end if
Generating SQL: Top1 single level aggregate • “department with max num students” WITH temp(DEPTID, COURSEID) AS ( SELECT DEPARTMENT.DEPTID, count(COURSES.COURSEID) FROM COURSES, DEPARTMENT WHERE DEPARTMENT.DEPTID = COURSES.DEPTID GROUP BY DEPARTMENT.DEPTID), temp2(COURSEID) AS (SELECT max(COURSEID) FROM temp) SELECT temp.DEPTID, temp.COURSEID FROM temp, temp2 WHERE temp.COURSEID = temp2.COURSEID
Generating SQL translateSQN(CI, SQN) … 3. if SQN has a w-node and a double level aggregate then 3.1 Produce view u = makeSimpleStatement (CI,SQN) 3.2 Produce view v = aggregate of u from the second level aggregate term in the CI excluding the w-node in the SELECT and GROUP BY clauses 3.3 Produce r = Join u and v, equation on all the common columns 3.4 Return r end if
Generating SQL: Top1 double level aggregate • “department student with max avg grade” WITH temp( DEPTID, ID, GRADE) AS ( SELECT STUDENTS.DEPTID, STUDENTS.ID, avg(ENROLLMENT.GRADE) FROM ENROLLMENT, STUDENTS WHERE STUDENTS.ID = ENROLLMENT.ID GROUP BY STUDENTS.DEPTID , STUDENTS.ID), temp2( DEPTID, GRADE) AS (SELECT DEPTID, max(GRADE) FROM temp GROUP BY DEPTID) SELECT temp.DEPTID, temp.ID, temp.GRADE FROM temp, temp2 WHERE temp.DEPTID = temp2.DEPTID AND temp.GRADE = temp2.GRADE
SQAK Expressiveness • No formal definition of rSQL expressiveness – Future work • Queries based on weak reference – cannot be expressed
Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview
Experiments Metrics • Data Precision Irrelevant, either the correct data is retrieved or not • Effectiveness • Savings • Parameters • Cost
Experiments - Effectiveness Average grade received by a student named William from the EECS department • SQAK – 93% (14 out of 15) • Steiner – 60% ( 9 out of 15)
Experiments - Effectiveness SQAK – 100% Steiner – 87% (13 of 15)
Experiments - Savings CS num students VS. SELECT DEPARTMENT.NAME, count(STUDENTS.ID) FROM STUDENTS, DEPARTMENT WHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’ GROUP BY DEPARTMENT.NAME
Experiments - Savings • Measure: #of schema elements + # of join conditions • M(CS num students) = 1 • M(SQL) = 5 SELECT DEPARTMENT.NAME, count(STUDENTS.ID) FROM STUDENTS, DEPARTMENT WHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’ GROUP BY DEPARTMENT.NAME • Saved = 5 – 1 = 4
Experiments - Savings • Measure: #of schema elements + # of join conditions • Not taken into account • SQL syntax and correctness • Top1 single and double level constructions
Experiments - Savings • Measure: #of schema elements + # of join conditions • Average Savings:
Experiments - Parameters • - Mismatch tolerance threshold Match(keyword to column) < ? MATCH : NOT_MATCH • f – Mismatch penalty (used by the scorer)
Experiments - Parameters • Low sensitivity to mismatch threshold [0.4, 0.8] • In lower thresholds wrong columns were selected for CI • As expected, lower penalty (f=1.5) leads to lower accuracy • Robust: f=2 or 3, between 0.4-0.8
Experiments – More Schemas • Large DB of IT assets of large enterprise • 600 tables, each with several columns • Sample queries provide accuracy as presented • Generating SQL, always less than 1 second • Warehouse DB • 14 tables, star schema • Star schema is easier for SQAK – no backtracking
Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview
Other challenges • Approximate Matching • The user doesn’t know columns names • Proposal: hint list, either on paper or in code, or ontology based normalization • Missing Referential Integrity • Referential constraints not defined by the DBA • Proposal: Use (out of scope) algorithm to discover referential constraints
Other challenges • Tied or Close Plans • Several SQNs with (close to) best score • Can occur in similar names, different semantic areas of DB • The user selects the relevant SQL • Future research: Visualizing the interesting SQNs
Other challenges • Expressiveness • Users adopt SQAK quickly and pose queries such as age > 18 • SQAK adds an appropriate WHERE clause
Overview • SQAK - A system to create SQL queries with aggregates from keywords • Useful – No knowledge of schema or changes are required • Expressive, but with limitations • Trade-off between correctness and computability cost. • Execution in many common cases – polynomial, exponential at worst case
Personal opinion • The idea is powerful • Experiments on industrial data would emphasize the strengths and the weaknesses • Some of the results were expected (Steiner tree) • Translating CI to SQN is solving the same NP complete problem repeatedly. Caching mechanism would be very beneficiary