670 likes | 745 Views
Capturing Semantics in XML Documents. Tok Wang Ling Department of Computer Science National University of Singapore. Roadmap. XML documents and current XML schema languages ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] The applications of ORA-SS
E N D
Capturing Semantics in XML Documents Tok Wang Ling Department of Computer Science National University of Singapore KDXD 2006, Singapore
Roadmap • XML documents and current XML schema languages • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] • The applications of ORA-SS • Discovering Semantics in XML documents • Conclusion [4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005 KDXD 2006, Singapore
Roadmap • XML documents and current XML schema languages • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) • The applications of ORA-SS • Discovering Semantics in XML documents • Conclusion KDXD 2006, Singapore
1. XML – Brief introduction • XML (eXtensible Markup Language) is • Released by W3C • An application of SGML • A promising standard of data publishing, integrating and exchanging on the web • XML schema • DTD (Data Type Definition) [3] • XSD (XML Schema Definition), W3C recommended standard [6, 7, 8] [3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/ [6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/ [8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/ KDXD 2006, Singapore
1. XML – A motivating example • Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where • The document has a root element psj; • Under psj, there is a sequence of part elements; • Under part, there is a sequence of supplier elements; • Under supplier, there is a sequence of project elements. KDXD 2006, Singapore
Example 1. psj.xml <?xml version="1.0" encoding="UTF-8"?> <psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…"> <part> <pno>P001</pno> <pname>Nut</pname> <color>Silver</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget><qty>60</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>650</qty> </project> </supplier> <supplier> <sno>S002</sno> <sname>Beta</sname> <city>Atlanta</city> <city>New York</city> <price>5.5</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>70</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>50</qty> </project> </supplier> </part> … … <part> <pno>P002</pno> <pname>Nut</pname> <color>Copper</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>4.6</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>60</qty> </project> </supplier> <supplier> <sno>S003</sno> <sname>Beta</sname> <city>New York</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget><qty>20</qty> </project> <project> <jno>J004</jno> <jname>Blue fireworks</jname> <budget>20000</budget> <qty>50</qty> </project> </supplier> </part> </psj> KDXD 2006, Singapore
1. XML – the DTD of the “psj.xml” <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier(sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price(#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> ▼♦ psj ▼♦part ♦ pno ♦ pname ♦ color ▼♦supplier ♦ sno ♦ sname ♦ city ♦price ▼♦project ♦ jno ♦ jname ♦ budget ♦qty (a) “psj.dtd”, The DTD of the “psj.xml” (b) psj.dtd in Data Guide KDXD 2006, Singapore
1. XML – what the DTD says • DTD is a simple definition of an XML document, where users can define • Element/Attribute types • Occurrence constraints (e.g. ?, +, *) • Containment among different element types (the structure) • DTD cannot express • Occurrence constraints in numbers (e.g. 2 to 8) • Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD.) • Relationship types among elements and their degrees • Difference between the attribute (or simple element) of element type and the attribute (or simple element) of relationship type. Simple elements are those element types with PCDATA only without any attribute types. KDXD 2006, Singapore
<xs:schema xmlns:xs = “…”> <xs:element name = “psj”> <xs:complexType> <xs:sequence> <xs:element name="part"> <xs:complexType> <xs:sequence> <xs:element name="pno" type="xs:string"/> <xs:element name="pname" type=" xs:string"/> <xs:element name="color" type=" xs:string"/> <xs:element name="supplier" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="sno" type=" xs:string"/> <xs:element name="sname" type=" xs:string"/> <xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/> <xs:element name="price" type=" xs:string"/> <xs:element name="project" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="jno" type=" xs:string"/> <xs:element name="jname" type=" xs:string"/> <xs:element name="budget" type=" xs:string"/> <xs:element name="qty" type=" xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:keyname="PK"> <xs:selector xpath="part"/> <xs:field xpath="pno"/> </xs:key> </xs:element> </xs:schema> XSD definition ofelement occurrence constraint XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique. 1. XML – XSD “psj.xsd”, the XSD schema of the motivating example data. KDXD 2006, Singapore
1. XML – what XSD can tell • XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which • has extensible XML syntax, • supports more data types (user-defined type and 37 built-in types) • is able to represent uniqueness/keyfor both attribute types and element types. • And has many other improvements in comparison with DTD. KDXD 2006, Singapore
1. XML – XSD still flaws XSD is not sufficient in expressing the relational semantics in XML data, such as: • A key constraint is specified by a keyelement. The key constraintsin XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases. • E.g.In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document. • Therefore, when an element type is located in a lower level such as supplier and project, XSD cannotdeclare sno and jno as their key attributes (OIDs) respectively. KDXD 2006, Singapore
1. XML – XSD still flaws (cont.) • The keyelement must contain the following (in order): • One and only one selectorelement • contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique • One or more field elements • contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element. - The key constraint is similar to the uniqueconstraint, except that the column on which a unique constraint is defined canhave null values. KDXD 2006, Singapore
1. XML – XSD still flaws (Cont.) • XSD does not support relationship types and other relational semantic constraints. • E.g.The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD. • XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types. • E.g.Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier. KDXD 2006, Singapore
Reconsider the semantics in Example 1. • The XML data in Example 1. (psj.xml) is a typical data-centric XML document that is derived from structured data contents usually stored in relational or object-relational databases. • The semantics of the data in Example 1. can be described in the ER diagram as follows. KDXD 2006, Singapore
The ER diagram of the data in Example 1. KDXD 2006, Singapore
One of the object-relational database representations of psj.xml part supplier project PS PSJ There 5 tables in the relational schema: part (pno, pname, color) supplier (sno, sname, (city)+) project (jno, jname, budget) PS (pno, sno, price) PSJ (pno, sno, jno, qty) KDXD 2006, Singapore
Roadmap • XML documents and current XML schema languages • ORA-SS (Object-Relationship-Attributemodel for Semi-Structureddata) • The applications of ORA-SS • Discovering Semantics in XML documents • Conclusion KDXD 2006, Singapore
2. ORA-SS in a nutshell • ORA-SS is a semantics rich data model for semi-structured data. • It can easily represent the relational semantics and constraints in XML data. • ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases. • In comparison with traditional ERdiagram, ORA-SS schema diagram represents the hierarchical structure of XML data. KDXD 2006, Singapore
2. ORA-SS in a nutshell • A complete ORA-SS model has 4 diagrams • Schema diagram • Represents the structure and constrains (business rules) on XML documents • Instance diagram • Visually represents the graphical structure of XML data • Functional dependency diagram • Represents FDs in relationship types • Inheritance diagram • Represents the specialization/generalization relationships among different object classes in ORA-SS KDXD 2006, Singapore
2. ORA-SS data models • Object class • attributes of object class • orderingon object class • Relationship Type • degree of relationship type • participating object classes in relationship type • attributesof relationship type • disjunctive relationship type • recursive relationship type • ID dependent relationship type KDXD 2006, Singapore
2. ORA-SS data models (Cont.) • Attribute • attributes of object class or relationship type • key attribute (OID) • foreign key / referential constraint (IDREF/IDREFS) • composite attribute • disjunctive attribute • attribute with unknown structure • ordering on attributes • fixed or default value of attribute • derived attribute KDXD 2006, Singapore
p a r t P S , 2 , + , + s u p p l i e r c o l o r p n a m e p n o P S J , 3 , + , + P S + p r o j e c t s n o s n a m e c i t y p r i c e P S J j n o j n a m e b u d g e t q t y The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as object classes. PSis a binaryrelationship type between part and supplier, PSJ is a ternary relationship type defined among part, supplier and project Pno, sno and jno are declared as the object ID of part, supplier and project respectively. Priceis an attribute of the relationship type PS; and qtyis an attribute of PSJ. KDXD 2006, Singapore
ORA-SS – Features • ORA-SS can represent the following semantics • Object ID attributes play the key constraints in object-relational databases, i.e. the object ID attributes functional determine (or multi-valued determine) object attributes of the same object class. • Various relationship types including ID dependent relationship types, their degrees and participating object classes. • Distinguish relationship attributes from object attributes. KDXD 2006, Singapore
Roadmap • XML documents and current XML schema languages • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) • The applications of ORA-SS • Discovering Semantics in XML documents • Conclusion KDXD 2006, Singapore
3. ORA-SS applications • Due to the rich semantics in ORA-SS, the model can be widely used in • Normal form XML schema • Relational/object-relational storage of XML data • XML viewcreation and validation [1] • XML schema/data integration • XML data query, especially with graphical user interfaces [5] • XML query optimization • etc. [1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003. KDXD 2006, Singapore
3. ORA-SS applications Store ORA-SS in object-relational databases • Current existing storage approaches store XML in flat files (NF relations), which are long and difficult to query and update; • Pure relational DBMS – join needs much time. • ORA-SS reflects the nested structure of semi-structured data • Less join in nested relations KDXD 2006, Singapore
3. ORA-SS applications Store ORA-SS in object-relational databases(Cont.) Given an ORA-SS schema diagram • Each object class is stored as an object relation with its object ID and its object attributes. (e.g. part, supplier, project) • Each relationship type is stored as a relationship relation with the object IDs of participating object classes and its relationship attributes. (e.g. PS and PSJ) • Multi-value attributes and composite attributes are stored as nested relations. (e.g. city) KDXD 2006, Singapore
3. ORA-SS applications Store ORA-SS in object-relational databases (Cont.) Storage Schema for ORA-SS/XML Databases of the data in Example 1. ORA-SS schema diagram Storage schema Object Relations part (pno, pname, color) supplier (sno, sname, (city)+) project(jno, jname, budget) Relationship relations PS (pno, sno, price) PSJ (pno, sno, jno, qty) Constraint: PSJ[pno, sno] PS[pno, sno] KDXD 2006, Singapore
3. ORA-SS applications Store ORA-SS in object-relational databases (Cont.) An example to show the advantage of using object-relational database instead of relational database. ORA-SS schema diagram Storage schema in traditional RDB Storage schema in ORDB Employee (eno, ename, (hobby)*, quantification(year, degree, Univ)*, job_history(year, job_title, company)*) Employee (eno, ename) E_hobby (eno, hobby) E_quantification (eno, year, degree, Univ.) E_job_history (eno, year, job_title, company) KDXD 2006, Singapore
3. ORA-SS applications Define and validate XML views • Valid XML views in ORA-SS • View definition operators:select, project/drop, swap, join For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels: Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view. Valid view Invalid view KDXD 2006, Singapore
3. ORA-SS applications Define and validate XML views (cont.) Another example, consider the following projection operation that drops supplier from the structure: Invalid view Valid view Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view. KDXD 2006, Singapore
3. ORA-SS applications Graphical XML query based on ORA-SS A graphical XML query language is designed on the base of ORA-SS Query 1: To select and display the projects that do not have any suppliers located in Atlanta. The schema panel loads the ORA-SS schema diagram Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window. Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window The screenshot of the user-interface of our graphical query language KDXD 2006, Singapore
3. ORA-SS applications XML query optimization • The semantic information represented in ORA-SS is also helpful in optimizing XML query. Consider the following simple query example which means, (Query 2.) To display the budget of project “J001”. KDXD 2006, Singapore
3. ORA-SS applications XML query optimization • Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values. • However, in ORA-SS, since jno is the object ID and we have the functional dependecny: jno budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value. KDXD 2006, Singapore
Roadmap • XML documents and current XML schema languages • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) • The applications of ORA-SS • Discovering Semantics in XML documents • Conclusion KDXD 2006, Singapore
4. Discover semanticsin XML documents • Problem definition • Input: a well formed XML document, probably with a DTD or XSD schema • Output: semantics that are necessary to ORA-SS schema • It is a process of enriching XML schema to ORA-SS schema by using mining techniques. KDXD 2006, Singapore
4. Discover semantics in XML documents • Related issues in mining semantics • Object classes • Identify object classes • Identify object IDs • Identify object attributes and their cardinalities • Identify IDREF(s) attributes • Relationship types • Find relationship types with their degrees and participating object classes • Find attributes and their cardinalities of relationship types KDXD 2006, Singapore
4. Discover semantics in XML documents The whole vision of the process. The main flow of the process The output flow The input flow KDXD 2006, Singapore
4. Discover semantics in XML documents • Assumption • To simplify the discussion, we do not consider the order of attributes and elements. • User-verification • The findings of each steps during the process should be verified by the user. • The verified findings of previous steps would be used in later steps. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object classes • Identify object classes from element types: • Scan the XML document or, if possible, the DTD/XSD of the XML document to select all internal nodes in the document tree. • An internal node means the node must have some child nodes such as XML attribute types and/or subelement types. • An internal node may not be an object class, but an object class must correspond to an internal node. Therefore, internal nodes are candidates of object classes. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object classes (cont.) • Detecting composite attributes from object classes • Although composite attributes are also internal nodes, there are some special patterns that indicate they are not object classes. The first pattern is that, all subelement types or attributes are XML element • Single-valued • Always occur with the same order • No functional dependencycan be found within the component attributes of a composite attribute. XML elements Or XML attributes values KDXD 2006, Singapore
4. Discover semantics in XML documents Find object classes (cont.) student studNo XML element XML elements Or XML attributes values The second pattern is that, all subelement types or attributes are: • Of the same type (repeated) • The set of the subelement/attribute values is oftendeterminedby other element/attribute values. (e.g. studNo determines the values of hobby elements under “hobbies” element) KDXD 2006, Singapore
4. Discover semantics in XML documents Find object classes (cont.) The DTD of Example 1. Dataguide <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENTpart(pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier(sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> ▼♦ psj ▼♦part ♦ pno ♦ pname ♦ color ▼♦supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦project ♦ jno ♦ jname ♦ budget ♦ qty From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can be intuitively found in Dataguide). Then, the list {psj, part, supplier, project } contains candidate object classes. Because a well-formed XML document usually have a document root that is not concerned with the data, we can drop the root node psj from the list and get the final result { part, supplier, project }. KDXD 2006, Singapore
4. Discover semantics in XML documents Identify multi-valued attributes • After Object classes and composite attributes are identified, we pick out all multi-valued attributes for later use. • Multi-valued attributes can be detected by checking the occurrence constraints in DTD/XSD, or counting directly in the document. • Multi-valued attributes can be either of an object class (e.g. city of supplier) or a relationship type. To determine the affiliation of multi-valued attributes, we need to find object ID first. • Without considering multi-valued attributes, the search of object ID would be easier. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object IDs • For each identified object class (after user-verified) • Ifit is located at the first level below the document root, and the DTD/XSD has specified ID attribute or key constraint, then the corresponding attribute/element should be an object ID. • Otherwise • A temporary table is built, which contains all XML attributes and single-valued simple subelement types of the object class. • To find full functional dependencies in the temporary table. • Ifall attributes/elements are fully functional dependent on an attribute/element k, then k is most likely the object ID; Else, • find an attribute/element k’, which functional determines the most number of attributes/elements, k’ is suggested as the object ID, • and the attributes/elements that are not determined by k’ will be classified as single-valued attributes of some relationship types to be determined later. • The result should be verified by the user. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object IDs (cont.) Candidate object classes list {part, supplier, project} <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) Notice that, in this stage, all simple subelement types and attributes are treated the same. Multi-valued attributessuch as city is not included inside the temporary table. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object IDs (cont.) Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) 1. In part_temp, we find that pno pname, color thus, pno is the object ID of part. 2. In supplier_temp, we only have sno sname thus, sno is the object ID of supplier, and price is picked our as a relationship attribute. 3. In project_temp, we only have jno jname, budget thus, jno is the object ID of project, and qty is picked out as a relationship attribute. KDXD 2006, Singapore
4. Discover semantics in XML documents Find object IDs • In the stage after the process of identifying object IDs, we find out: • Object IDs of each object class, • Single-valued object attributes and their corresponding object classes, • Single-valued relationship attributes without knowing what relationship type they belong to. KDXD 2006, Singapore
4. Discover semantics in XML documents Multi-valued attributesof object classes • Recall that, before searching object ID, all multi-valued attributes are identified. Given a multi-valued attribute under an object class, we check, • for each object ID value of the object class, whether there is a unique set of values of the attribute • If it is true, then it is a multi-valued attribute of the object class; Else, it is classified as a multi-valued attribute of some relationship type not known yet. KDXD 2006, Singapore
4. Discover semantics in XML documents Multi-valued attributes of object classes • For example, the city is a multi-valued attribute under supplier • We check sno and city, since each sno value is associated with the same set of city values, city is a multi-valued attribute of supplier The temporary table of sno and city KDXD 2006, Singapore