1 / 26

Extracting Schema From Data

Extracting Schema From Data. The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema .

umed
Download Presentation

Extracting Schema From Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Schema From Data • The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema . • Given a semistructured data, compute automatically some schema for it, given several possible answers, we want the schema that best describes the structure of that particular data.This is called Schema Extraction.

  2. Schema Extraction for schema graphs • Schema Extraction for Datalog Typings

  3. Data Guides • Our goal is to construct a new OEM graph that is a finite description of the list of paths. This is called Data Guide. The two properties to be fulfilled: • Accurate : Every path in the data occurs in the data guide, and every path in the data guide occurs in the data. • Concise : Every path occurs exactly once.

  4. We proceed as follows: The Data Guide will have a root node, call it Root. Next we examine one by one each path in the list and add new nodes to the data guide, as needed: • employee • employee.name • employee.manages • employee.manages.managedby • employee.manages.managedby.manages • employee.manages.managedby.manages.managedby • company

  5. Root &r Employees &p1,&p2,&p3,&p4 &p5,&p6,&p7,&p8 employee position name company phone managedby manages managedby Regular &p2,&p3,&p5 &p7,&p8 Boss &p1,&p4,&p6 worksfor worksfor name phone manages position name Company &c worksfor name A Data Guide

  6. Root employee manages company Emp worksfor managedby name Comp position phone name Schema graph

  7. Simulation between a data graph and a data guide Node in data graph Node in data guide &r Root &p1, &p2, &p3, &p4, &p5, Employee &p6, &p7, &p8 &p1, &p4, &p6 Boss &p2, &p3, &p5, &p7, &p8 Regular &c Company Simulation from the data guide to the schema graph Node in data guide Node in schema graph Root Root Employee Emp Boss Emp Regular Emp Company Comp

  8. This construction of the data guide resembles the technique to transform a nondeterministic finite state automaton into a deterministic one . • The data guide is the most specific schema graph for that data with the following features: • The data guide is a deterministic schema graph. • Any other deterministic schema graph to which our data conforms subsumes the data guide.

  9. A nondeterministic schema Root&r employee employee company managedby Regular &p2,&p3,&p5 &p7,&p8 Boss &p1, &p4,&p6 worksfor manages name name phone Comp &c worksfor

  10. Extracting Datalog rules from data We have a semistructured data instance and want to extract automatically the most specific typing given by a set of Datalog rules. We create one predicate for each complex value object in the data. We create the following predicates: pred_r, pred_c, pred_p1, pred_p2, pred_p3, pred_p4, pred_p5, pred_p6, pred_p7, pred_p8 corresponding to the objects &r, &c, &p1, &p2, &p3, &p4, &p5, &p6, &p7, &p8 .

  11. Next we write a set of Datalog rules defining each predicate based exactly on the outgoing edges of its corresponding object: pred_r(X) :- ref(X, company, Y), pred_c(Y), ref(X, employee, Z1), pred_p1(Z1), …… ref(X, employee, Z8), pred_p8(Z8) pred_c(X) :- ref(X, name, N), string(N) pred_p1(X) :- ref(X, worksfor, Y), pred_c(Y), ref(X, name, N) , string(N), ref(X, phone, P), string(P), ref(X, manages, Z), pred_p2(Z), ref(X, manages, U) , pred_p3(U)

  12. pred_p2(X) :- ref(X , worksfor, Y) , pred_c(Y), ref(X, name, N) , string(N), ref(X, manageby, Z), pred_p1(Z) pred_p3(X) :- ….. …… We have to compute the largest fixpoint of the Datalog program on the given data.

  13. Object Predicate pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pred_p5 pred_p6 pred_p7 pred_p8 &r &c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8 Extents of predicates after one iteration

  14. Object Predicate pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pred_p5 pred_p6 pred_p7 pred_p8 &r &c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3 &p3 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8 Extents of predicates after two iterations

  15. We obtain the following Datalog rules: Root(X) :- ref(X, company, Y), Company(Y), ref(X, employee, Z1), Boss1(Z1), ref(X, employee, Z2), Boss2(Z2), ref(X, employee, U1), Regular1(U1),…….., ref(X, employee, U3), Regular3(U3) Company :- ref(X, name, N), string(N) Boss1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular1(Z), ref(X, manages, U), Regular2(U) Boss2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular3(Z) Regular1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, managedby, Z), Boss1(Z) Regular2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss1(Z) Regular3(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss2(Z)

  16. Inferring Schemas From Queries Some semistructured data instances are the result of queries. Query Result Schema Inferring The following query takes a bibliography file and constructs a homepage for every author: where bib -> L -> X, X - > “author” -> A, X -> “title” -> T, X -> “year” -> Ycreate Root( ), HomePage(A), YearEntry(A,Y), PageEntry(X)link Root() -> “person” -> HomePage(A), Homepage(A) -> “year” -> YearEntry(A,Y) YearEntry(A,Y) -> “paper” -> PaperEntry(X) PaperEntry(X) -> “title” -> T, PaperEntry(X) -> “author” -> HomePage(A), PaperEntry(X) -> “year” -> Y

  17. Root person person Homepage (“smith”) Homepage (“Jones”) author author author year year year YearEntry (“smith”,1997) YearEntry (“smith”,1995) YearEntry (“smith”,1997) author paper paper paper paper PaperEntry (o423) PaperEntry (o552) PaperEntry (o153) year title year year title title

  18. Schema graph inferred from the query The schema will have one class for each function , and one edge for each line in the link clause. Root person HomePage year YearEntry author paper PaperEntry title year

  19. where • create Root( ), F(X), F(Y), G(X), H(Y) • link Root( ) -> “A” -> F(X), F(X) -> “C” -> G(X), Root( ) -> “B” -> F(Y), F(X) -> “D” -> H(Y) For the following example: We reach the following schema: Root: {A : F, B : F} F : {C : G, D : H}

  20. Path Constraints • In Relational Databases • in RDB, the relational declaration tell us more than the types • imposes a key constraint so that no two tuples have the same key Example Create table Employees ( Emp Id: integer, EmpName: char(30), DeptId: integer, … primary key(EmpId), foreign key(DeptId) references Departments ) Create table Departments( DeptID: integer, Dname: char(10), …… primary key(DeptId) )

  21. In Object-Oriented Databases • Interface Publication extent publication { attribute String title; attribute Date date; relationship set<Author> auth --->inclusion constraints inverse Author::pub; --->inverse relationship } • Interface Author extent author { attribute String title; attribute String address; relationship set<Publication> pub --->inclusion constraints inverse Publication::auth; --->inverse relationship }

  22. Inclusion constrainsts: • For any publication p, the set p.auth is a subset of the set author. Similarly, for any author a, the set a.pub is a subset of publication. Inverse relationships: • For any publication p, and for any author a in p.auth, p is a member of a.pub . • For any author a, and for any publication p in a.pub, a is a member of p.auth .

  23. Illustration of path constraints on semistructured data r publication publication author author auth auth auth pub pub pub title date name address title date name address ... ... ... ... ... ... ... ...

  24. In semistructured data • inclusion constraint is expressed as follows • p (a (author(r,a)  pub(a,p)) -> publication(r,p)) • The general form of an inclusion constraint is • x ((r,x)) -> (r,x)) • inverse relationship is • p ( publication(r,p) ->  a(auth(p,a) -> pub(a,p))) The general form of this constraint is x ((r,x)) -> y((x,y)-> (y,x)))

  25. Constraints are also important in Query Optimization. Here is an example: Select row: P2from r.publication P1, r.publication P2, P1.auth Awhere “Database Systems” in P1.title and A in P2.auth Select row: P’from r.publication P, P.auth A, A.pub P’where “Database Systems” in P.title The query plan implicit in the first one requires two iterations over publication - with P1,P2 - whereas the second requests only one iteration - with P .

More Related