700 likes | 860 Views
A Type System for a Semistructured and XML Data Base Management System. Ph. D. Thesis Proposal Dario Colazzo. Thesis Goals. Formal developement and study of a type system for XML querying Implementation of a concrete type system for an XML data base management system: the Xtasy system.
E N D
A Type System for a Semistructured and XML Data Base Management System Ph. D. Thesis Proposal Dario Colazzo
Thesis Goals • Formal developement and study of a type system for XML querying • Implementation of a concrete type system for an XML data base management system: the Xtasy system
Presentation outline • Semistructured data and XML • Data models • Type languages: DTD, XML Schema • Querying XML data: Tequyla • Processing XML data: XDuce • Thesis goals
Semistructured data • Irregular and instable structure • Self-describing representation • No separate schema information: few guarantees of reliability and efficiency of applications
OEM graph addrbook person person name age addr name age email “Dario Colazzo” 30 “Pisa” first second 30 “sartia@xyz.com” “Carlo” “Sartiani”
XML syntax <addrbook> <person> <name>Dario Colazzo</name> <addr>Pisa</addr> </person> <person> <name> <first>Carlo </first> <second>Sartiani</second> </name> <addr>Pisa</addr> <email>sartia@xyz.com</email> </person> </addrbook>
Attributes and element reference <db> <state id="01"> <name>Italy</name> <code>IT</code> </state> ....... <city region=“Toscana” state-of="01"> <name>Italy</name> <code>PI</code> </city> </db>
XML Query Data Model • Based on node labeled forest trees (set of documents) • Several kind of nodes: • element node • attribute node • value node • Identifier and reference attributes modeled as general attribute
XML Tree addrbook element node attribute node person person value node name email addr name age addr age first second “Dario Colazzo” “Pisa” 30 “Pisa” “sartia@xyz.com” 30 “Carlo” “Sartiani”
XML schema languages • Document Type Declarations: schemas as grammars for documents. Regular type expressions • XML Schemas: closer to traditional type languages
DTD • Regular type expressions: • T | U union • T,U sequence • T* zero or more • T? zero or one • X=T[X] recursive definitions • coupled-tag element declarations • global definitions • only one base type: string (PCDATA) • no type reusing
DTD, example zero or more <!DOCTYPE addrbook[ <!ELEMENT addrbook (person*) <!ELEMENT person (name, addr, tel?)> <!ELEMENT name #PCDATA> <!ELEMENT addr #PCDATA> <!ELEMENT tel #PCDATA> zero or one
XML Schema • decoupled-tag: elements and types may be defined separately • local definitions • base types: intgers, string, decimal,... • type reusing: • type refining • type extension with subtyping
XML Schema, example <xsd:complexType name="person"> <xsd:sequence> <xsd:element name="name" type="xsd:string" /> <xsd:element name="age" type="xsd:ageType"/> <\xsd:sequence> <\xsd:complexType> <xsd:complexType name="newPerson" base="typeOfPerson" derivedBy="extension"> <xsd:element name="car" type="xsd:string" /> <\xsd:complexType>
Querying XML data • XML querying is based on the use of patterns to select portions of document • Untyped query languages: • XQL • XML-QL • Quilt • Typed: • Tequyla • XDuce (functional language) • Forthcoming W3C query language...?.. • probably Quilt
Tequyla • SQL-like query language • query free-nesting • typed: • query correctness • query typing • Currently: only non algorithmical definitions, and weak subtyping
Tequyla queries • The body of a Tequila query is a from clause composed by XPath patterns • x=addressbook.xml; • bind to x the root element of addressbook.xml • y in x//person/addr • starting from the root (x) search for a person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y
A Tequyla query Q = from x=addressbook.xml; y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z] XPath
XDuce • Typed functional language • Regular expressions types • Type based pattern language
XDuce schema • A schema is a set of type definitions E= { Addressbook = addrbook [(Name, Addr, Tel?) *] Name = name [String] Addr = addr[String] Tel = tel[String] }
An XDuce funtion: telephone list • Consider T= (Name, Addr,Tel?) in fun mkTelList : T* --> (Name,Tel)* = name[n], addr[a], tel[t], rest:T* --> name[n],tel[t], mkTelList(rest) | name[n], addr[a], rest: T* --> mkTelList(rest) | () --> ()
XDuce subtyping: language inclusion • XDuce provides a simple but rather powerful notion of subtyping based on inclusion between sets of values • Examples • Name, Addr <: Name, Addr,Tel? • Name, Addr,Tel <: Name, Addr,Tel? • XML Schema extension subtyping is not captured
Type language • As expressive as DTD and XML Schema • Base types • Attributes and id/idref types • Type refining and extension • Local type definitions • Unordered sequence types
Schema extraction and schema inferring • For untyped data, a schema will be inferred according to the XML Schema style • For typed XML data, the schema will be converted in the internal schema representation • Type inference for query results
Data conformity • An algorithm will be defined to check data conformity to a schema • The problem is EXPTIME-complete • Optimization techniques exist • Further ones has to be found to deal with unordered sequence types and id/idref types
Query correctness • Only type correct queries will be executed • Type correctness is based on successful matching between the query structural requirements and the type of the data to be queried
Correct queries, an example (1/2) Consider E= { Adrressbook = addrbook [Person*] Person = (Name, Addr, Tel?) Name = name [String] Addr = addr[String] Tel = tel[String] }
Correct queries, an example (2/2) • A correct query: Q = from x=addressbook.xml; y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z]
Correctness & union types • Consider: Q’ = from x=addressbook.xml; y in x//person/addr; z in x//person/tel; where y="Pisa" select results[z] • Schould we consider this query correct?
Correctness & union types: existential approach • The previous query is considered as correct • The user will be warned about optional elements required by patterns
Total approach • The previous query is considered as not correct • Too severe discipline • A lot of queries with non empty results would be cut off
Type equivalences • Several type equivalences laws will be considered • In particular: • (T | U) , S = (T , S) | (T , S) • Useful to simplify schema definitions
Subtyping • A subtype relation E E’ will be defined such that: • If a query Q is correct wrt E’ then it is also correct wrt E • Type extension will be supported: if E is an extension of E’ then E E’
Parametric polymorphism (1/3) • Used in some functional languages (e.g. ML and Haskel) to define generic functions, for example: funtion Sort (t :Type; L:List t; Ord:tX t Bool): List t begin ..... end. • It will allow us to define generic queries
Parametric polymorphism (2/3) • Parametric types fits well in the description of irregular data structure • For example E(t)= {Adrressbook = addrbook [(Name, Addr, Tel?) *] Name = name [String] Addr = addr[t] Tel = tel[String]} • addr elements content can have, for example, a street and a city sub-element
Parametric polymorphism (3/3) • A generic query: Q = t: Type; a : E(t) . from x= a ; y in x//person/addr; z in x//person/name; where z=“dario" select indirizzo[y] • More precise typing: the type Any* is different from t*
Conclusions • The type system will provide: • union types • reference types • recursive types • subtyping • parametric polymorphism
Presentation outline • Proposal • What has been done • Ongoing and future work
Thesis Goals • Formal developement and study of a type system for XML querying • The query language is an abstract version of XQuery (W3C) • The type langueage is expressive enough to capture the essence of current standards
Xquery type system • Only result analisis: XQuery type system is defined to determine and check at query-analysis time the output type of a query on documents conforming to an expected input type. • Query correctness is not defiend and checked (only some ideas).
What has been done • We have: • formally defined the notion of query type correctness • defined a type system to statically check it and to perform result analisys; the rules define a terminating algorithm. • intruduced an alternative, wrt Xquery, approach to deal with recursive types
Observations • Our type system also performs query analisys and, in this respect, presents some differences wrt XQuery approach • Till now, we have considered a type system feeaturing product, union and recursive types • We have discovered that these type mechnanism are sufficient enough to make the study interesting and (as we will see) rather subtle.
Observations • discovered that for particular queries (fortunately not frequent ones) the type system is not able to exactly capture the semantical characterization of correctness • Introduced a further notion of correctness, path-covering, and provided rules to check this property
Papers • A first defintion of the type system can be found in A Typed Text Retrieval Query Language for XML Documents , Journal of the American Society for Information Science and Technology (JASIS)Special Issue 2001 • In Types for Correctness of Queries over Semistructured Data, the system has been improved by a finer notion of query correctness and by the notion of path covering. The work will be submitted at WebDB2002 workshop
Tequyla (or µXQuery) • SQL-like query language • query free-nesting • typed: • type conformance of data • query correctness • query typing (result unalysis)
Tequyla queries • The body of a Tequila query is a from clause composed by XPath patterns • x=addressbook.xml; • bind to x the root element of addressbook.xml • y in x//person/addr • starting from the root (x) search for a person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y
Types • T,U ::= () empty sequence B atomic type (char, int,…) T + U union T; U sequence l[T] element type X type name • Type environments: type definitions + type binding for query free variables E ::= () X=T, E x:X, E
A type environment • E= Adrressbook= addrbook [ Person*], Person= person[Name, Addr, (Tel +EMail)], Name = name [String], Addr = addr[String], Tel= tel[String], EMail= email[String], x: Adrressbook