350 likes | 482 Views
XML Schema Integration. Resources : Louise Lane & Kalpdrum Passi, Sanjay Madria and Mukesh Mohania - “A Model for XML Schema Integration”, and My Research in Fall, 2001 with Dr. Madria. Contents. What is XML Data Integration Why business applications use XML What is XML Schema
E N D
XML Schema Integration Resources : Louise Lane & Kalpdrum Passi, Sanjay Madria and Mukesh Mohania - “A Model for XML Schema Integration”, and My Research in Fall, 2001 with Dr. Madria
Contents • What is XML • Data Integration • Why business applications use XML • What is XML Schema • Different ways to integrate XML data • XML Schema Integration • XML Namespaces • Phases in Schema Integration • XML Schema Data Model • Graphical representation of the model
Contents contd.. • Conflicts resolution • Integration phase • Construction of Global schema • Advantages • Disadvantages • Conclusion
What is XML XML is a markup language for documents containing structured information. A markup language is a mechanism to identify structures in a document. XML documents are self-describing, thus XML provides a platform independent means to describe data and therefore, can transport data from one platform to another. XML documents can be created and used by applications.
Data Integration E-Commerce applications use data from different sources and need to be integrated. A mediated schema is created to represent a particular application domain and data sources are mapped as views over the mediated schema.
Why Business applications use XML Business applications needs to exchange data between different applications. The data should be transparent from representation and should be platform independent. XML is also used when one or more organizations merge. When organizations merge, interoperability among documents is necessary which can be achieved using XML integration.
XML Schema XML Schema is the recommended as the standard schema language by W3C to validate documents. XML Schema has a stronger expressive power than DTD schema for the purpose of data exchange and integration from various sources of data.
Different ways to integrate XML data • Integrating XML documents • Mapping of local schemas to global/integrated schema if the global schema is known, or Querying the data to obtain the required global schema. • Integrating XML Schemas
Extracting Schema from XML Documents Minimal Spanning graphs from different documents can be extracted and the Schema can be constructed using these graphs. Heuristic rules are applied on the obtained spanning graphs to construct the schema. The paper “Re-engineering Structures from Web Documents” – Chuang-Hue, Ee-Peng, and Wee-Keong deals with constructing Schema in DTD for given XML documents.
Complexities in integrating XML Documents • Need to extract the schema from the document. • Integrate the schemas obtained or perform mapping from the individual schema documents to the global schema if the global schema is already present. • Parse the XML documents and integrate the data according to the global schema. Querying on XML documents can be done to obtain the integrated document.
Tukwila Data Integration System Tukwila Data Integration system uses a mediated schema to integrate data from different sources. The user asks a query over the mediated schema and the data Integration system reformulates the query over the data sources and executes it. Tukwila uses an Query Re-formulator and Optimizer to query large amounts of data efficiently. MiniCon algorithm is used to map the query from the mediated schema to data sources. It uses an x-scan operator that can query streaming XML data.
Tukwila x-scan operator To query an XML document, Querying techniques like XML-QL and XQL needs the complete XML document to be downloaded and is then queried.
Tukwila x-scan operator contd.. Tukwila X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network.
XML Schema Integration The automated integration of XML schemas is beneficial to both the traditional forms of view integration and database integration. An integrated schema forms the basis for a valid query language over a particular set of XML documents. The schemas to be integrated currently validate a set of existing XML documents, data integrity and continued document delivery are chief concerns of the integration process.
XML Namespace XML schema requires the use of namespaces to uniquely identify schema structure ( elements, attributes, datatypes, etc. ). The name of each structure is prefaced by a namespace prefix which identifies the namespace that the structure is defined within. A practical example of schema integration is when two companies merge.
<?xml version="1.0" ?> <gs_equipment xmlns="http://www.GSE1example.org" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.GSE1example.org GSE1.xsd"> <machine type=”baggage_handler”> <supplier>Air to Ground</supplier> <serial_number>FRD6754</serial_number> <service_agreement> <expiry_date>01-01-2006</expiry_date> </service_agreement> <service_hours>345</service_hours> </machine> <location> <airport>Vancouver</airport> <terminal>6A</terminal> </location> </gs_equipment> <?xml version="1.0" ?> <gs_equipment xmlns="http://www.GSE2.example.org" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.GSE2example.org GSE2.xsd"> <placement> <airport>Winnipeg</airport> <terminal>main</terminal> </placement> <machine type=”tow_truck”> <serial_number>123456145</serial_number> <vendor>Quick as a Jet GSE</vendor> <service_agreement>QJ-TT-123456145-September 2003 </service_agreement> <service_hours>1090.75</service_hours> </machine> </ge_equipment> Documents and schemas of the companies that merge
<?xml version="1.0"?> <schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema" targetNamespace="http://www.GSE1example.org" elementFormDefault="qualified" xmlns:GSE1="http://wwwGSE1example.org> <element name ="gs_equipment"> <complexType> <sequence> <element ref="GSE1:machine" minOccurs="1" maxOccurs="1"/> <element ref="GSE1:location" minOccurs="1" maxOccurs="1"/> </sequence> </complexType> </element> <element name ="machine”> <complexType> <sequence> <element name="supplier" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="serial_number" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element ref=”GSE1:service_agreement" minOccurs="1" maxOccurs="1" /> <element name="service_hours" type="xsd:integer" minOccurs="0" maxOccurs="1" > <xsd:attribute name="type" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="boarding_stairs"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <sequence> </complexType> </element> <element name ="service_agreement”> <complexType> <sequence> <element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" /> </sequence> </complexType> </element> <element name ="location"> <complexType> <sequence> <element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /> </sequence> </complexType> </element> </schema> <?xml version="1.0"?> <schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema" targetNamespace="http://www.GSE2example.org" elementFormDefault="qualified" xmlns:GSE2="http://wwwGSE2example.org> <element name ="gs_equipment”> <complexType><sequence> <element name="GSE2:placement” minOccurs="1" maxOccurs="1“ /> <element ref="GSE2:machine" minOccurs="0" maxOccurs="1"/> </sequence></complexType> </element> <element name="placement"> <complexType><sequence> <element name="GSE1:airport" minOccurs="1" maxOccurs="1" /> <element name="GSE1:terminal" minOccurs="1" maxOccurs="1" /> </sequence></complexType> </element> <element name ="machine"> <complexType> <all> <element name=”vendor” type=”xsd:string” minOccurs=”0” maxOccurs=”1”> <element name="service_hours" type="xsd:decimal" minOccurs="0“ maxOccurs="1" > <element name="serial_number" type="xsd:positiveInteger" minOccurs="0" maxOccurs="1" /> <element name="service_agreement" type="xsd:string" minOccurs="0" maxOccurs="1" /> </all> <xsd:attribute name="type" use="optional"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="tow_truck"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </complexType> </element> </schema>
An object-oriented data model that is called as XSDM ( XML Schema Data Model ) is defined. A three-layered architecture consisting of pre-integration, comparison and integration is used for the integration. A global schema must meet the following criteria: completeness, minimality and understandability. Optionality of elements is expanded to meet boundary restrictions.
Three Phases of integration Pre-Integration: In this phase element, attribute and datatype definitions are extracted through parsing the actual schema document. Comparison: In this phase, the correspondences between elements and attributes are determined either by using semantic learning or using human interaction. Integration: In this phase, conflicts that exist between the corresponding elements and/or attributes such as naming conflicts, datatype conflicts and structural conflicts are resolved.
XML Schema Data Model (XSDM) Basically four structures are defined – Node Object, Child Object, Datatype Object and Attribute Object. Node Object : Represents an element, which may be either non-terminal or terminal. Each node represents another set of structures that define the node – Name, Namespace, Attribute, Datatype, Substitution Group Name, Child list and Node Type which has six types – terminal, sequence, choice, all, any or empty. Child Object : Represents an element, which is a part of childList. Each child has structures that define itself – Name, namespace, Max Occurances, and Min Occurances.
XML Schema Data Model (XSDM) contd.. Datatype Object : Represents datatype of elements and attributes. The structures that define this are Name, Variety(atomic, union, list), Kind(43 simple and derived datatype), and Constraining Facets. Attribute Object: Represents attributes associated with a non-terminal or terminal element. The structures that define an attribute – Name, Namespace, Use, DataType, and value(default value).
Conflict Resolution Naming Conflicts: Synonym Naming Conflict: Different names but same defination. Solved using substitution group names. Homonym Naming conflict: Same name but different structure. Homonym conflicts at Non-terminals are called structural conflicts and at terminals are called datatype conflicts.
Conflict Resolution contd.. Datatype & scale differences: Disjoint or incompatible datatypes – union E.g. String, integer Compatible datatypes – scale adjustment E.g. Integer, float Enumerated datatype – taking set of all the enumerations E.g. {a,b}, {b,c} => {a,b,c} Scale differences – constraint facet redefinition
Conflict Resolution contd.. Structural Conflicts: Type Conflicts: Terminal in one schema and non-terminal in another schema – Add both to the global schema. Key conflicts: If both schemas have their individual keys, then the global schema’s key should be a composite of both the keys. If an element is declared as key in one schema and as a non-key in other schema, a complete knowledge of the data present in the documents is required. If the same element is declared as key in both the schemas, a prefix can be added to the keys to make the key elements unique globally.
Integration phase • Constructing correspondences table • Constructing dependencies table Correspondences table contain the information about the corresponding elements/attributes. An entry in the Dependencies table denotes the dependency of an element on other elements/attributes. The elements/attributes are integrated only after their dependencies are integrated.
Construction of the Global schema Document Once the integration process is completed, the global schema in XSDM notation is used to construct the global XML schema document. The construction of the XML schema document is a straight-forward process because all the data about the schema is present in the XSDM notation.
<?xml version="1.0"?> <schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema" targetNamespace="http://www.GSEMexample.org" elementFormDefault="qualified" xmlns:GSEM="http://wwwGSEMexample.org xmlns:GSE2="http://wwwGSE2example.org > <element name ="gs_equipment”> <complexType><choice> <sequence> <element ref="GSEM:machine" minOccurs="1" maxOccurs="1"/> <element ref="GESM:location" minOccurs="1" maxOccurs="1" /> </sequence> <sequence> <element ref="GESM:location" minOccurs="1" maxOccurs="1" /> <element ref="GSEM:machine" minOccurs="0" maxOccurs="1"/> </sequence> </choice></complexType> </element> <element name ="machine"> <complexType> <all> <element name="supplier" type="xsd:string" minOccurs="0" maxOccurs="1" /> <element name="serial_number" type="serial_number_type" minOccurs="0" maxOccurs="1" /> <element ref=”GSEM:service_agreement" minOccurs="0" maxOccurs="1" /> <element ref=”GSE2:service_agreement” minOccurs=”0” maxOccurs=”1”/> <element name="service_hours" type="decimal" minOccurs="0" maxOccurs="1" > <element name="vendor" type="xsd:string" minOccurs="0" maxOccurs="1" > </all> <xsd:attribute name="type" use="optional"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="boarding_stairs"/> <xsd:enumeration value="tow_truck"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </complexType> </element> Global schema document
<xsd:simpleType name=”serial_number_type”> <xsd:union> <xsd:string> <xsd:positiveInteger> </xsd:union> </xsd:simpleType> <element name ="service_agreement”> <complexType><sequence> <element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" /> </sequence></complexType> </element> <element name ="location" substutionGroup =”GESM:placement”> <complexType><sequence> <element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /> </sequence></complexType> </element> <element name ="placement"> <complexType><sequence> <element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /> </sequence></complexType> </element> </schema> Global schema document Contd..
Advantages This method is useful when a required global schema is not present. The global XML schema obtained is complete, minimal and understandable. Human interaction is required only for a limited level. Even though local schemas are large and complex, the global schema can be obtained efficiently.
Disadvantages User interaction is required, cannot do the task by only using semantic learning. Not successful in resolving all key conflicts. Complete knowledge on data is required to resolve these. The method doesn’t have an cross check on the users input. The process may result in a un minimal schema if the user doesn’t recognize all the correspondences.
Conclusion This method is successful in integrating schema documents. The method explained is implementable.