170 likes | 300 Views
XML Today. assoc. prof. Vladimir Dimitrov Faculty of Mathematics and Informatics, Sofia University POB 1829, Sofia 1000, Bulgaria cht@fmi.uni-sofia.bg. Abstract.
E N D
XML Today assoc. prof. Vladimir Dimitrov Faculty of Mathematics and Informatics, Sofia University POB 1829, Sofia 1000, Bulgaria cht@fmi.uni-sofia.bg
Abstract State of the art of XML usage in information systems is discussed. XML products are classified into the following categories: Middleware, XML-Enabled Databases, Native XML Databases, XML Servers, Wrappers, Content Management Systems, XML Query Engines, and XML Data Binding. DBMS based ways of incorporating XML is analyzed.
Introduction XML is designed for: • Document interchange between different systems (local or remote); • Structured data export/import from databases; • Textual databases (semi-structured and unstructured documents) used for full text retrieval.
Introduction XML products can be classified into the following categories: • Middleware. This software is used by data-centric applications to transfer data between XML documents and databases; • XML-Enabled Databases. Database systems extended for data transfer between XML documents and the database. They are used in data-centric applications; • Native XML Databases. Database systems that store XML in "native" form, which can be some variant of the DOM mapped to an underlying persistent data store. They are used in data- and document-centric applications; • XML Servers. These are a XML-aware J2EE servers, Web application servers, integration engines, and custom servers. These servers can be used for distributed applications or simply for publishing of XML documents on the Web. They are used for data- and document-centric applications; • Wrappers. This kind of software treats XML documents as a source of relational data. Typically they support SQL for querying XML documents. It is used in data-centric applications; • Content Management Systems. These are applications support content/document management. They can be implemented with XML databases or directly on the file system. Usually they include features such as check-in/check-out, versioning, and editors. They are used in document-centric applications; • XML Query Engines. These are standalone engines that support XML documents querying. They are used in data- and document-centric applications; • XML Data Binding. Products that bind XML documents to objects. They can also support persistent objects into the database. They are used in data-centric applications.
Middleware Middleware software is used by data-centric applications to transfer data between XML documents and databases. It usually runs in the process space of the application and usually accesses data in relational databases using ODBC, JDBC, or OLE DB. Some examples of this kind of software are: • ADO from Microsoft. ADO can Recordset objects as save XML documents and can restore Recordset objects from XML documents. In this case, Recordset objects are used to transfer data between XML documents and databases. ADO XML document is divided in two parts: first one maps the XML data from the second part into the Recordset. This mapping is described with an annotated version of XML-Data Reduced. In ADO XML tree of nested elements is presented as a tree of nested Recordsets and vice versa. Updates, deletes, or inserts are flagged in the XML document with ADO-specific tags. • Delphi from Borland. It is application development tool that supports the transfer of data between XML documents and databases through the use of client data sets. All of the data is local to the client in the client data set. Last ones can be bound to databases or XML documents. Client data sets are used as mediator between XML documents and databases. Client data set are mapped into XML document and vice versa via tables. With client data sets its possible to emulate object-relational mapping. • XML SQL Utility for Java, XSQL Servlet from Oracle. XML SQL Utility for Java is a set of Java classes for transferring data between a relational database and an XML document. They can be used through the provided front ends or in a user-written application. If the database system supports SQL 3 object views, the product uses an object-relational mapping; otherwise it uses a table-based mapping for a single table. The XML SQL Utility for Java accepts XML documents or DOM Documents. It returns results as XML documents, a DOM Documents, or SAX2 events and may include inline XML Schemas. The XML SQL Utility for Java supports updates and deletes. XSQL Servlet is a Java servlet that uses the XML SQL Utility for Java.
XML-Enabled Databases XML-enabled database systems have extensions for transferring data between XML documents and their own databases. They are used by data-centric applications. Some examples of this kind of software are: • Access 2003 from Microsoft. It transfers data to/from XML documents using a table-based mapping. Individual data values must be in child elements (attributes are ignored) and table/column names must match element names. Access 2003 can create an XML Schema document describing exported data. • DB2 from IBM. DB2 supports XML in the base DB2 product (as publishing functions in SQL/XML), in the XML Extender and Text Extender, and in its Web services framework. The DB2 XML Extender (DB2 UDB Extender) can store XML documents in columns of type VARCHAR, CLOB, or files using the XMLVARCHAR, XMLCLOB, or XMLFILE user-defined types or in XML collections. Data Access Definition (DAD) file allows one or more elements/attributes to be indexed. XML collections map non-XML data to an XML document according to a DAD document. There two different mappings: SQL mapping and RDB node mapping. SQL mapping uses templates to specify where the results should be placed. RDB node mapping is an object-relational mapping and can be used to transfer data both to and from the database. A visual tool is provided for constructing DAD documents - that is, mapping elements and attributes to tables and columns. Applications use stored procedures to invoke the XML Extender. The XML Extender manages DAD documents and DTD-s in its own tables. The XML Extender can send XML documents to and retrieve XML documents from MQSeries message queues, validate XML documents against XML Schemas or DTD-s, transform XML documents with XSLT, copy XML documents between files and the database, and extract values from XML documents. The DB2 Text Extender supports many search technologies, such as fuzzy searches, synonym searches, and searches by sentence or paragraph. DB2 WORF uses DADX documents to define Web services. DADX documents extend the functionality of DAD documents and describe how a Web service accesses data in the database. Supported functionality includes storing and retrieving documents with the XML Extender, executing SQL queries, and calling stored procedures. DB2 WORF can also generate WSDL documents from DADX documents. • FoxPro Microsoft. Visual FoxPro transfers data between an XML document and a FoxPro table with: CURSORTOXML, XMLTOCURSOR, and XMLUPDATEGRAM. CURSORTOXML and XMLTOCURSOR use a table-based mapping. Column data can be represented either as attributes or as child elements. FoxPro can use XML Schema to determine the mapping. If XML Schema is not presented, FoxPro analyzes the XML document to determine the structure of the document and to construct the mapping. FoxPro generates an inline schema when transferring data from the database to XML. • Informix from IBM. Informix supports XML through its Object Translator and through the Web DataBlade. In Object Translator, XML support is provided through generated methods that transfer data between objects and XML documents. A GUI tool can be used to create object-relational mappings from XML documents to the database. The Web DataBlade is an application that creates XML documents from templates containing embedded SQL statements and other scripting language commands.
XML-Enabled Databases • Oracle 8i, Oracle 9i XDB from Oracle. Oracle 9i XDB supports both XML-enabled and native storage of XML data. It blurs the boundaries between relational data and XML data by providing SQL features (implemented at the engine level) that allow users to view relational data as XML and XML data as relational. The main feature is the XMLType data type. This is a predefined object type that can store an XML document. Like any object type, XMLType can be used as the data type of a column in a table or view. The latter usage is important, as it means that an XML "view" - a virtual XML document - can be constructed over any data, regardless of whether it is relational data or XML data. A number of operators have been added to SQL to help view XML data as relational data and vice versa. XMLType data can be stored in either of two ways: with object-relational storage or as a CLOB. The storage options are interchangeable and XML applications use the same code regardless of which option is chosen. XMLType data can be accessed in several ways. Java Beans (which can be generated from an XML Schema) can be used when the data uses object-relational storage. The DOM can be used regardless of the storage option. (The DOM implementation populates nodes lazily for better concurrency.) Both methods can cache changes and store them later with a call to XMLType.save(). In addition, data can be accessed by executing SQL statements that use the operators mentioned earlier. The other major feature of XDB is the XML Repository. This provides a file system-like view of XMLType objects in the database. That is, XMLType objects (regardless of whether they actually contain XML data or are just XML views over relational data) can be assigned a path and corresponding URL in the repository hierarchy. These can then be accessed via WebDAV, FTP, JNDI, and SQL; the latter has special operators for this purpose. In addition, the repository maintains properties for each object, such as owner, modification date, version, and access control. • SQL Server 2000 from Microsoft. Microsoft SQL Server 2000 supports XML in three ways: the FOR XML clause in SELECT statements, XPath queries that use annotated XML-Data Reduced schemas, and the OpenXML function in stored procedures. SELECT statements and XPath queries can be submitted via HTTP, either directly or in a template file. The FOR XML clause has three options, which specify how the SELECT statement is mapped to XML. RAW models the result set as a table, with one element (named "row") returned for each row. Columns can be returned either as attributes or child elements. AUTO is the same as RAW, except that: 1) the row elements are named the same as table name, and 2) the resulting XML is nested in a linear hierarchy in the order in which tables appear in the select list. Annotated XML-Data Reduced schemas contain extra attributes that map elements and attributes to tables and columns. These specify an object-relational mapping between the XML document and the database, and are used to query the database using a subset of XPath. A tool exists to construct mapping schemas graphically. The OpenXML function uses a table-based mapping to extract any part of an XML document as a table and use it in most places a table name can be used, such as the FROM clause of a SELECT statement. This can be used in conjunction with an INSERT statement to transfer data from an XML document to the database. An XPath expression identifies the element or attribute that represents a row of data. Additional XPath expressions identify the related elements, attributes, or PCDATA that comprise the columns in each row, such as the children of the row element. • Sybase ASE 12.5 from Sybase. Sybase supports XML in two ways. First, the ResultSetXml class can transfer data between an XML document and the database. A ResultSetXml object can be created from an XML document or a SELECT statement. Among other things, applications can modify the data in a ResultSetXml object, serialize the data to an XML document, or create an SQL script to create a table for the data and store the data in the database. The XML document used by ResultSetXml has a proprietary format that contains a set of ColumnMetaData elements followed by a set of Row and Column elements. Sybase also has native XML capabilities. It can store XML documents in a pre-parsed, indexed form in BLOB columns. These can then be queried with XQL.
Native XML Databases A native XML database is one that: • Defines a (logical) model for an XML document - as opposed to the data in that document - and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models are the XPath data model, the XML Infoset, and the models implied by the DOM and the events in SAX 1.0. • Has an XML document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its fundamental unit of (logical) storage. • Is not required to have any particular underlying physical storage model. For example, it can be built on a relational, hierarchical, or object-oriented database, or use a proprietary storage format such as indexed, compressed files.
Native XML Databases Native XML databases fall into two broad categories: • Text-based storage. Store the entire document in text form and provide some sort of database functionality in accessing the document. A simple strategy for this might store the document as a BLOB in a relational database or as a file in a file system and provide XML-aware indexes over the document. A more sophisticated strategy might store the document in a custom, optimized data store with indexes, transaction support, and so on. • Model-based storage. Store a binary model of the document (such as the DOM or a variant thereof) in an existing or custom data store. For example, this might map the DOM to relational tables such as Elements, Attributes, and Entities or store the DOM in pre-parsed form in a data store written specifically for this task. This includes the category formerly known as "Persistent DOM Implementations".
Native XML Databases There are two major differences between the two strategies. First, text-based storage can exactly round-trip the document, down to such trivialities as whether single or double quotes surround attribute values. Model-based storage can only round-trip documents at the level of the underlying document model. This should be adequate for most applications but applications with special needs in this area should check to see exactly what the model supports. The second major difference is speed. Text-based storage obviously has the advantage in returning entire documents or fragments in text form. Model-based storage probably has the advantage in combining fragments from different documents, although this does depend on factors such as document size, parsing speed (for text-based storage), and retrieval speed (for model-based storage). Whether it is faster to return an entire document as a DOM tree or SAX events probably depends on the individual database, again with parsing speed competing against retrieval speed.
Native XML Databases Native XML databases differ from XML-enabled databases in three main ways: • Native XML databases can preserve physical structure (entity usage, CDATA sections, etc.) as well as comments, PIs, DTDs, etc. While XML-enabled databases can do this in theory, this is generally not done in practice. • Native XML databases can store XML documents without knowing their schema (DTD), assuming one even exists. Although XML-enabled databases could generate schemas on the fly, this is impractical in practice, especially when dealing with schema-less documents. • The only interface to the data in native XML databases is XML and related technologies, such as XPath, the DOM, or an XML-specific API. XML-enabled databases, on the other hand, offer direct access to the data, such as through ODBC.
Native XML Databases Some examples of this kind of software are: • Berkeley DB XML from Sleepycat Software. Berkeley DB XML is an application-specific native XML data manager built on Berkeley DB. Berkeley DB XML provides storage and retrieval for native XML data and semi-structured data. Berkeley DB XML is supplied as a library that links directly into the application's address space. This eliminates bottlenecks that occur in client-server systems. APIs are available in a number of languages, including C++, Java, Python, Perl, Ruby, and Tcl. Berkeley DB XML stores XML documents in collections. A single application may operate on many collections at the same time. A single application may also combine data from different collections easily. Non-XML data may be included by creating standard Berkeley DB tables. Tables and collections may be used together, with full support for Berkeley DB transactions and recovery services, by multiple users simultaneously. Berkeley DB XML enables fast look up by allowing individual collections to be indexed differently. This allows Berkeley DB XML to speed up the common queries over particular collections. Each collection supports multiple indexes. A wide variety of available indexing schemes support different XPath queries efficiently. Berkeley DB XML's Query Processor implements XPath 1.0. A cost-based query optimizer considers the indices that exist, the data volume that a query is likely to produce and the cost of computation and disk I/O to select a query plan with the lowest run-time cost. • Lore from Stanford University. Semi-structured data is data with more structure than a conversation, but less structure than a telephone book. A good example is a resume (curriculum vitae). While virtually all resumes include a name, address, and telephone number, only some will include an email address, Web site, or FAX number. Most will include a list of previous jobs, but others might include only a list of university courses. Depending on the profession, there might be a list of software used or licenses held. XML is well-suited to storing semi-structured data and shares a feature common to many semi-structured data models: it is self-describing. That is, it carries a certain amount of metadata with the data. In the case of XML, this is in the form of element type and attributes names. The legality of well-formed documents mirrors another feature found in many semi-structured data models: the data model is not required to have a definitive schema, and the model can be extended at will by the addition of new fields. Lore is a database designed for storing semi-structured data. Although it predates XML, it has recently been migrated for use as an XML database. It includes a query language (Lorel), multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recovery, as well as the ability to import external data. Because Lore is designed for use with semi-structured data, XML documents without DTDs can be easily stored. An interesting feature of Lore is a DataGuide, which is a "structural summary of all paths in the database". Unlike structured databases, in which the structure is specified first and data is added according to that structure, data is entered first into Lore and the structure is then summarized. The resulting information useful for query processing. The Lore executables are "available for public use". Source code may be available in some circumstances.
Native XML Databases • Tamino from Software AG. Tamino XML Server is a suite of products built in three layers - core services, enabling services, and solutions (third-party applications) - which may be purchased in a variety of combinations. Core services include a native XML database, an integrated relational database, schema services, security, administration tools, and Tamino X-Tension, a service that allows users to write extensions that customize server functionality. The XML engine uses the Data Map, which describes where the data in a given XML document is stored. This allows individual XML documents to be composed of data from multiple, heterogeneous sources, such as the native XML data store, relational databases, and the file system. Since the connections to external data (made through the X-Node module) are live and bidirectional, Tamino may thus be used to perform heterogeneous joins and updates. Tamino's XML support includes the DOM, JDOM, SAX, and XML:DB APIs, an extended XPath implementation called X-Query (not to be confused with W3C XQuery, which it predates), full-text retrieval, processing of XML documents with server-side XSL and CSS, and limited support for SOAP. It can store schema-less documents and can use schema information (including a subset of XML Schemas) if it is available. The internal SQL engine is directly addressable through ODBC, JDBC, and OLE DB. However, when addressed via these APIs, it cannot integrate data from the internal XML data store or from external data sources. Enabling services include X-Port, X-Plorer, X-Application, various APIs, X-Node, and the WebDAV Server. X-Port provides URL-based data transfer through various standard HTTP servers, X-Plorer is a browser-based navigation tool for documents stored in Tamino, and X-Application is a set of JSP tags for accessing Tamino through Web pages. The WebDAV Server adds namespace management, additional properties and overwrites protection to the existing Tamino XML Server functionality. This allows Tamino to serve as a virtual file system where the information can be stored and retrieved using a standard Web browser and the common drag and drop metaphor. Tamino is not built on top of Adabas, a hierarchical database from Software AG. Instead, the Tamino data store was built from the ground up as a native XML database, obviously drawing on the knowledge gained from developing Adabas.
XML Servers XML servers are XML-aware J2EE servers, Web application servers, integration engines, and custom servers. Unlike middleware, XML servers usually run in a separate process space from the application. Some XML servers are used to build distributed applications, such as e-commerce and business-to-business applications, where XML serves as the data transport. Others are used simply to publish XML documents to the Web. XML servers often contain complete application development environments and may provide access to data in a variety of data stores, including legacy databases, email messages, and application data. Net.Data from IBM is a Web server add-on for transferring data from a database to XML (or any text-based format). The product uses templates with a Net.Data-specific macro language. This is quite flexible, including variables, function definitions, loops, and if statements, as well as being able to parameterize SQL statements for nested queries.
Wrappers Wrappers are systems that treat XML documents as a source of relational data. (The term comes from federated database systems, where a wrapper is a component that "wraps" a source system so its data uses the model (usually relational) of a target system.) You can think of wrappers as the opposite of XML-enabled databases. That is, with wrappers, XML data is treated as relational data, while with XML-enabled databases; relational data is treated as XML data. Wrappers can be used in a variety of situations. One common use is so that data from an XML document can be included in a heterogeneous join - that is, a SELECT statement that joins data from different systems. Another common use is for editing XML documents. Although this latter use might seem surprising, it provides developers an easy and familiar way to modify XML documents that are structured like a table. Wrappers typically implement an SQL query engine, use an object-relational or table-based mapping, and work only with data-centric documents. DB2 Information Integrator of IBM and OpenXML function in SQL Server of Microsoft are examples of such a kind of software.
Content Management Systems Content management systems are systems for storing, retrieving, and assembling documents from document fragments (content). They generally include such features as editors, version control, and multi-user access. Although they are usually built on top of a database (some are built on top of the file system), this is generally hidden from the user. SiberSafe of SiberLogic as example. SiberSafe is a 100% Java, TCP/IP, HTTP and WebDAV-enabled multithreaded, load-balanced XML repository server that provides XML content management functionality in the following areas: • User-defined fragmentation of XML documents; • Fragment-level storage, locking and retrieval of XML documents; • Fragment-level versioning of XML documents; • Fragment-level indexing and search of XML documents; • XML document dependency tracking and external entity management; • Publishing XML documents into various formats, including PDF, RTF, HTML etc; • Fully integrated workflow control, including tracking of tasks and assignments; • Project-wide fragment-level branching and merging; • Multi-language translation automation; • Windows NT, WebDAV and HTTP (browser) clients available out-of-the-box; SiberSafe can also store documents other than XML, such as images and DTDs. SiberSafe is DTD neutral and can work with any user-defined DTD. It can run on any JDBC-compatible database, however is by default configured for Microsoft Access.
Conclusion Information presented here is collected from the Web. There are many software products from above defined categories. Here as examples are presented only the most important one from commercial and scientific point of view. There are two main problems with XML in the database systems: • How to store and retrieve XML documents? • How to merge older technologies (relational and object-relational ones) with XML? These problems are focused in the implementation of quickly evolving standard XQuery – only a few database systems support it. Will XML be merged in current database systems, how it has been happened with object-oriented databases and relational databases in object-relational ones, the future will show, but undoubtedly XML is a new challenge to the database systems.