1 / 58

Completeness of Queries over Incomplete Databases

Completeness of Queries over Incomplete Databases. Werner Nutt j oint work with Marco Montali , Sergey Paramonov , Simon Razniewski , Ognjen Savkovic , Alex Tomasi , Fariz Darari

gerd
Download Presentation

Completeness of Queries over Incomplete Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CompletenessofQueriesoverIncomplete Databases Werner Nutt jointworkwith Marco Montali, Sergey Paramonov, Simon Razniewski, OgnjenSavkovic, Alex Tomasi, FarizDarari (VLDB‘11, CIKM’12, BPM‘13, ISWC‘13)

  2. Background Incompletenessis omnipresent in data management • Null values in relational databases: Codd 1975 • Representation systems: Imielinski/Lipski • 1984 Focus on certain/possible answers • Query completeness over incomplete databases: little attention Completeness of Queries over Incomplete Databases

  3. School Data Management in Bolzano Decentrally maintained database Statistical reports decentrallymaintaineddatabase („Popcorn“) School Statistics (ASTAT) ?? generally incomplete require complete data generallyincompleterequirecompletedata Completeness of Queries over Incomplete Databases

  4. Incompleteness in theSchool Data Facts in real world Facts in school database result(Paul, Math, A) result(Giulia, Math, A) result(Paul,Math,NULL) Missing information in the school database: - no entry for Giulia (missing record) - no grade for Paul (missing value) Completeness of Queries over Incomplete Databases

  5. Consequence: Query Answers are Incorrect Query Q: ”How many pupils have grade A in Math?” In the real world: Q( ) = 2 According to available database: Q( ) = 0  If data is incomplete, query answers become incorrect. result(Paul, Math, A) result(Giulia, Math, A) result(Paul, Math, NULL) Completeness of Queries over Incomplete Databases

  6. Why are Data About Pupils Incomplete? • Data have not yet been copied from the local school database to the central database • The copying procedure has been aborted • Pupils have been already registered/classes have been formed, but pupils have not yet been entered into the database • Some schools (e.g. vocational schools) administer student grades with Popcorn, others not • School careers of immigrants are often not captured Completeness of Queries over Incomplete Databases

  7. But: Data are Partially Complete • Grades of students at vocational schools are complete … • Grades of students at vocational schools are complete, … after reports have been handed out • Classes at school X are complete, when the classes have been formed … and entered into Popcorn How can we use information about partial completeness? Meta data! Business rules Stadium of a business process Deadlines ofbusiness processes Completeness of Queries over Incomplete Databases

  8. Use Metadata to Guarantee Completeness! ... vocational schools use the information system of the province to manage grades Suppose, we know whether parts of a db are complete, e.g., • “The grades from vocational schools are complete” • “The Math grades from primary schools are complete” Idea: Assess completeness of a query using completeness assertions for (parts of) tables ... primary schools took part in a survey of Math education Completeness of Queries over Incomplete Databases

  9. Reasoningabout Query Completeness Grades fromvocationalschoolsarecomplete I wanttoknow “Howmanypupilshave grade A in Math?“ Can I trustthequeryanswer? All Mathgrades fromprimaryschoolsarecomplete Biologygrades fromhigh schoolsarecomplete Youcannot, becauseinformationaboutpupilsfrom high schoolscouldbemissing Space of possible information Assertions about partial completeness Completeness of Queries over Incomplete Databases

  10. Reasoningabout Query Completeness (2) Grades fromvocationalschoolsarecomplete I wanttoknow“Howmanypupilsatvocationalschoolshave grade A in Math?“ Can I trustthequeryanswer? All Mathgrades fromprimaryschoolsarecomplete Biologygrades fromhigh schoolsarecomplete Youcan, because all neededinformationiscomplete in thedatabase Space of possible information Assertions about partial completeness Completeness of Queries over Incomplete Databases

  11. Research Questions: Howcanone ... … assert completeness of parts of a possibly incomplete database? • … formalize completeness of query answers? Grades fromvocationalschoolsarecomplete I wanttoknow“Howmanypupilsatvocationalschoolshave grade A in Math?“ Can I trustthequeryanswer? All Mathgrades fromprimaryschoolsarecomplete … implement such reasoning techniques? … infer completeness of query answers from such assertions? Biologygrades fromhigh schoolsarecomplete Youcan, because all neededinformationiscomplete in thedatabase Space of possible information Assertions about partial completeness Completeness of Queries over Incomplete Databases

  12. Interface Layer Data Layer MAGIK (= Managing Incomplete Knowledge) Output Input Evaluate SQL Queries (database mode) Schema constraints Gen. and Spec. Ideas: • Database queries = logical formulas • Completeness meta data = logical formulas • Analysis = logical inference • Implementation: using software for logical inferences TC-statements Query Results Query Completeness SQL queries Partially Complete Database Read Database Schema (database mode) Reasoning Layer Program Business Logic Read/Write Virtual Schema (virtual mode) Completeness Reasoner TuProlog DLV Engine Hibernate Store/Load TCs and Queries Meta-information Storage Completeness of Queries over Incomplete Databases

  13. RunningExample: Schema result(name, subject, grade) pupil(name, age, schoolName, schoolType) Completeness of Queries over Incomplete Databases

  14. Notation: Databases Database instances are sets of ground atoms, e.g., D = { result(Paul, Math, NULL), result(Giulia, Math, A), pupil(Paul, 17, Verdi, Voc) }, possibly containing NULLs. Completeness of Queries over Incomplete Databases

  15. Notation: Conjunctive Queries A single block SQL queries, possibly with DISTINCT, SELECT r.grade FROM result r, pupil p WHERE r.name = p.name ANDr.subject = ’Math’ ANDp.age <= 11 is expressed as a conjunctive query (CQ), using a Datalog rule: Q(g) :- result(n, Math, g), pupil(n, a, sn, st), a  11 Completeness of Queries over Incomplete Databases

  16. Notation: Conjunctive Queries (2) Q(x) :- L(x, y), M • L(x, y) conjunction of relational atoms • M conjunction of comparisons • x vector of distinguished (= output) variables • yvector of non-distinguished (= existential) variables Query answers (under set semantics): Q(D) = x L  D, ╞M Bag semantics: each  contributes a copy of x As a default, we assume set semantics Completeness of Queries over Incomplete Databases

  17. Possible Completeness Statements “We get complete answers to the following queries: • Which pupils have grade A in Math? • Which pupils from vocational schools have grade A in Math?Query Completeness Statements “The database contains • all subjects and grades of pupils from vocational schools • all subjects studied by pupils from vocational schools “ Table Completeness Statements Completeness of Queries over Incomplete Databases

  18. Formalization: IncompleteDatabase [Motro 1989] When talking about incompleteness, we need a complete reference An incomplete database Dis a pair of an ideal database Diand an available database Da D= (Di, Da) such that for each record in Da there is a “more informative” record in Di For databases w/o Nulls, this means DaDi Completeness of Queries over Incomplete Databases

  19. Example: An IncompleteDatabase Di Da result(Paul, Math, A) result(Giulia, Math, A) result(Maria, Math, A) pupil(Paul, 17, Hofer, Voc) pupil(Giulia, 15, Verdi, Sec) result(Paul, Math, A) result(Giulia, Math, NULL) pupil(Paul, 17, Hofer, Voc) less informativethan Completeness of Queries over Incomplete Databases

  20. Formalization: Query Completeness [Motro 1989] Query Q “The answer to Q is complete” Notation: Compl(Q) Semantics: (Di, Da) ╞ Compl(Q) iffQ(Di) = Q(Da) To be precise, we have to distinguish between set and bag semantics Completeness of Queries over Incomplete Databases

  21. Example: Query Completeness QMathA(n) :- result(n, Math, A) QMathA(Di) = {Paul, Giulia, Maria} QMathA(Da) = {Paul} QMathAis not complete over (Di, Da) Di Da result(Paul, Math, A) result(Giulia, Math, A) result(Maria, Math, A) pupil(Paul, 17, Hofer, Voc) pupil(Giulia, 15, Verdi, Sec) result(Paul, Math, A) result(Giulia, Math, NULL) pupil(Paul, 17, Hofer, Voc) Completeness of Queries over Incomplete Databases

  22. Example: Query Completeness (2) QMathAVoc(n) :- result(n, Math, A), pupil(n, a, sn, Voc) QMathAVoc(Di) = {Paul} QMathAVoc(Da) = {Paul} Di Da QMathAVocis completeover (Di, Da) result(Paul, Math, A) result(Giulia, Math, A) result(Maria, Math, A) pupil(Paul, 17, Hofer, Voc) pupil(Giulia, 15, Verdi, Sec) result(Paul, Math, A) result(Giulia, Math, NULL) pupil(Paul, 17, Hofer, Voc) Completeness of Queries over Incomplete Databases

  23. Table Completeness Statements: Idea This is a fulltuple-generatingdependency (TGD) “The tableresultcontains all resultsofpupilsfromvocationalschools“ means “If (n,s,g) is a resultrecord according to the ideal db, and (n, a, sn, Voc) is a pupilrecord in the ideal db, then (n,s,g) is in the resulttable of the available db” This canbeexpressedbytherule resulti(n,s,g), pupili(n, a, sn, Voc)  resulta(n, s, g) Wewritethistablecompletenessstatementas Compl( result(n, s, g) ; pupil(n, a, s, Voc)) Idea: an incompletedbsatisfiesthestatementiffitsatisfiestherule Completeness of Queries over Incomplete Databases

  24. Table Completeness Statements [Halevy 96] G may contain both, relational and built-in atoms A table completeness (TC) statement for a relation Ris an expression Compl(R(s1,…,sn) ; G) consisting of • an R-atom R(s1,…, sn) • a condition G such that R(s1,…, sn), G is safe. The TC-statement C = Compl(R(s1,…, sn) ; G) can be seen as a rule rC = Ri (s1,…, sn), GiRa (s1,…, sn) Semantics: (Di, Da) ╞ C iff(Di, Da) ╞ rC Completeness of Queries over Incomplete Databases

  25. Example: TC Statement Satisfaction resulti(n, s, g), pupili(n, a, sn, Voc)  resulta(n, s, g) holds over (Di, Da) becauseresult(Paul, Math, A) is in Da Di Da result(Paul, Math, A) result(Giulia, Math, A) result(Maria, Math, A) pupil(Paul, 17, Hofer, Voc) pupil(Giulia, 15, Verdi, Sec) result(Paul, Math, A) result(Giulia, Math, NULL) pupil(Paul, 17,Hofer, Voc) Completeness of Queries over Incomplete Databases

  26. The TC-QC ReasoningProblem I wanttoknow “HowmanypupilsatvocationalschoolshavetakenMath?“ Can I trustthequeryanswer? Grades fromvocationalschoolsarecomplete Set oftablecompletenessstatementsC QueryQ All Mathgrades fromprimaryschoolsarecomplete Biology grades from high schoolsarecomplete Space of possible information Assertions about partial completeness C╞ Compl(Q) ? Completeness of Queries over Incomplete Databases

  27. Reasoning: The Principle “Whichpupilsatvocationalschoolshad an A in Math?“ QMathAVoc(n) :- result(n, Math, A), pupil(n, sn, Voc) 1. AssumeQMathAVocreturnsn overDi 2. See whichfacts must be in Di result(n‘, Math, g‘)pupil(n‘, sn‘, Voc) Completeness of Queries over Incomplete Databases

  28. Reasoning: The Principle (2) 3. Usetablecompletenesstoderivefacts in Da “All resultsofpupilsatvocationalschoolsareavailable“ resulti(n, s, g), pupili(n, sn, Voc)  resulta(n, s, g) “All pupilsareavailable“ pupili(n, sn, st)  pupila(n, sn, st) Di Da result(n‘, Math, g‘) pupil(n‘, sn‘, Voc) result(n‘, Math, g‘) pupil(n‘, sn‘, Voc) Completeness of Queries over Incomplete Databases

  29. Reasoning: The Principle (3) 4. Query theavailabledatabase “Pupilsatvocationalschoolswith an A in Math“ QMathAVoc(Da) = {n'}    n' is also in Q(Da) Conclusion: QMathAVociscompletegiventhetablecompletenessstatements Da result(n‘, Math, g‘) pupil(n‘, sn‘, Voc) Completeness of Queries over Incomplete Databases

  30. TC-Transformation To C= Compl(R(s) ; G) we associate the query QC(s) :- R(s) , G and the transformation on db instances TC(D) :=  R(t) tQC(D)  For a set C of TC statements we define the transformation TC(D) := C  CTC(D) Completeness of Queries over Incomplete Databases

  31. TC-Transformations: Properties • (D, TC(D)) is an incomplete database • (D, TC(D)) ╞ C • (Di, Da) ╞ C iffTC(Di)  Da In other words: • (D, TC(D)) is the least incomplete database • with ideal dbD • that satisfies C Completeness of Queries over Incomplete Databases

  32. TC-QC Reasoning: Relational Case variables x, y considered as constants x, y Let • C set of relational TC statements • Q(x) :- L relational query • L := frozen version of L Theorem: C ╞ Compl(Q)iffxQ(TC(L)) What if C or Q contain comparisons? Completeness of Queries over Incomplete Databases

  33. Example: TC-QC with Comparisons Query: Qpupil(n) :- pupil(n, a, sn, st) C =  C10 : pupili(n, a, sn, st) , a  10  pupila(n, a, sn, st) C10 : pupili(n, a, sn, st) , a  10  pupila(n, a, sn, st) How can we chase L ={ pupil(n, a, sn, st) } with C10 , C10 ? Idea: Case analysis! Substitute “representative values” for a 10, a 10, a 10 Substitution yields: [a/9]L = { pupil(n, 9, sn, st) } to which we can apply C10 … • We retrieve n in all 3 cases • The cases cover all possibilities •  Q is complete wrtC Completeness of Queries over Incomplete Databases

  34. TC-QC Reasoningwith Comparisons Let • C set of TC statements with comparisons • Q(x) :- L, M •  set of representative value substitutions for C, Q Theorem: The following are equivalent • C ╞ Compl(Q) • xQ(TC(L)) for all  Completeness of Queries over Incomplete Databases

  35. Set Semantics vs. Bag Semantics Q(x) :- L query (Di, Da) ╞ Complset(Q) iff every answer of Q over Diis returned over Da, too iffL  Di ex.  s.th. L  Daand x = x (Di, Da) ╞ Complbag(Q) iff every answer of Q over Diis returned over Dathe same number of times iffL  Di L  Da “no assignments get lost” Completeness of Queries over Incomplete Databases

  36. TC-QC Reasoningfor Bag Semantics Let • C set of TC statements with comparisons • Q(x) :- L, M •  set of representative value substitutions for C, Q Theorem: C ╞ Complbag(Q)iff L TC(L) for all   Corollary: IfC has no comparisons, then: C ╞ Complbag(Q)iffL TC(L) Completeness of Queries over Incomplete Databases

  37. Complexity Classes of conjunctive queries: • CQ: Conjunctive queries with comparisons over dense orders • RQ: Relational conjunctive queries (i.e., without comparisons) • LCQ: Linear conjunctive queries (i.e., without self-joins) • LRQ: Linear relational conjunctive queries Completeness of Queries over Incomplete Databases

  38. TC-QCbag - Complexity Note, the axes are asymmetric: • NPappears with repeated relation symbols in the query • coNP appears with comparisons in the TC statements Completeness of Queries over Incomplete Databases

  39. TC-QCset - Complexity Intuition: the query has to be contained in the TC-statements . . . . . . but that does not explain it all Completeness of Queries over Incomplete Databases

  40. How Can One Implement Completeness Reasoning? Idea: Map reasoning tasks to a generic reasoner Candidate reasoners: • SMT (SAT modulo theories) solvers ? • encoding may be of exp. size for P2problems • Disjunctive Logic Programming with Answer Set Semantics ? • can express all P2problems • demo implementation for • conjunctive queries • finite domain constraints • keys and (acyclic) foreign keys Completeness of Queries over Incomplete Databases

  41. Completeness of Queries over Incomplete Databases

  42. Completeness on the Semantic Web Completeness of Queries over Incomplete Databases

  43. DBPedia Misses Some Facts … Completeness of Queries over Incomplete Databases

  44. IMDB Has Completeness Guarantees Completeness of Queries over Incomplete Databases

  45. If Completeness Info Were Available in RDF … Completeness of Queries over Incomplete Databases

  46. Federated Framework Completeness of Queries over Incomplete Databases

  47. Completeness of SPARQL Queriesover RDF Sources • Completeness statements in RDF • Reasoning algorithms for queries with • DISTINCT • OPT • over RDFS sources • Generation of queries with SERVICE calls over federated sources • Prototypical implementation using Apache Jena http://rdfcorner.wordpress.com Completeness of Queries over Incomplete Databases

  48. Verifying Query Completenessover Processes • Data often created following processes • Many processes are executed only partially formal (pen & paper, email, phone, …)  Valid information may be stored in databases with delays Database content is of questionable completeness Completeness of Queries over Incomplete Databases

  49. Enrolment Process in a School Database query: How many pupils? 0 Is that correct? Database query: How many pupils? 137 Is that correct? Completeness of Queries over Incomplete Databases 49

  50. Observation • At some points, new facts in the real worldhave not yet been stored queries may give wrong answers • At other points, all facts that hold in the real world have been stored  queries give correct answers Completeness of Queries over Incomplete Databases

More Related