150 likes | 278 Views
The Utility of XML. Martin Doerr. Center for Cultural Informatics. Institute of Computer Science. Foundation for Research and Technology - Hellas. Heraklion, May 25, 2001. XML is. XML is a compromise between databases and free texts
E N D
The Utility of XML Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Heraklion, May 25, 2001
XML is XML is a compromise between databases and free texts It takes the better from both sides without being perfect on either side. It is readable. It allows to disambiguate meaning. It is simple. It is rich enough to open a new systems paradigm.
What is a Document ? • A composite statement : a unit relating known facts, items and categories with new knowledge - linguistic or by other media. • It has an inner logic: the pure rendered knowledge, independent from language and form. • It has a meaningful structure: The sequence, arrangement or linking used to render the inner logic. • It has a presentation: Structure and style to assist perception and impression
The statements…. Diego Velasquez is Spanish. Diego Velasquez lived 1599-1660. Diego Velasquez painted “Juan de Pareja”. “Juan de Pareja” is a painting. “Juan de Pareja” has dimension 81,3X69,9cm Juan de Pareja is Moorish. Juan de Pareja is a painter. Philipp IV sent Velazquez to Italy. …..
What’s Wrong with HTML • If written properly, normal HTML may reflect document presentation, but it cannot adequately represent the semantics & structure of data Artifact Title Artist Name <B>MONET, Claude<B><BR> Haystacks at Chailly at Sunrise<BR> 1865<BR> Oil on canvas<BR> 30 x 60 cm (11 7/8 x 23 3/4 in.)<BR> San Diego Museum of Art <BR> <P> <IMGSRC=“http://192.41.13.240/artchive/ m/monet/hayricks.jpg”> Date Dimensions Material Image Reference Museum
User Problems/ Design Reasons • Preserving info units: who said that / self-contained • Entering data: • what can I say, • what should I say, • how can I say it. • Rendering data: how to tell my child, the public… • Accessing data: querying, mediation • Reusing data: transmission to other environments, merging, evolution of local system, preservation for future use.
In Technical Terms • Transformation under preservation of meaning • Correct adaptation of presentation without knowing meaning • Packaging information for presentation – “1 document” • Sequencing categories for data input. • Interpretation of intended meaning - searching • Automatic relating of common meaning – merging of different statements
What’s wrong with • Free texts: Clear packaging, rendering for one target, not machine processable (poor querying, categories uncomprehensive), poorly reusable, no help to enter data, transform data.. • HTML: Solves platform-independence of presentation, weak connection between meaning and presentation structure – not far better than free text. • Databases: Clear logical structure, categorization, machine processable, excellent querying, difficult presentation, transformation, merging, evolution, no information units • XML: Clear packaging, logical structure, machine processable if correctly used, clear separation and relation of meaningful structure and presentation. Helpful to enter data, easy to extend, transform, present. Can be queried, structure not independent from user view.
XML and databases • Databases: • Schema first: Prior to data, complete, inflexible analysis of all categories and their relations. • Table structures: indexes prepared, excellent consistency enforcement. • XML: • Data first; structure explanatory, can come second, need not be formalized, extensible, DTD’s can be combined • semi-structured: flexible, but reduced guarantee if a question can be answered, reduced consistency enforcement. • Embedded schema: each instance carries the schema it uses – querying by parsing without index structures – ideal transport format.
Data First, Embedded Schema • This document carries the interpretation with it. It is readable without knowledge of the schema. <ARTIST> <NAME><FIRST>Claude</FIRST><LAST>Monet</LAST></NAME> <ARTWORK> <ARTIFACT> <TITLE>Haystacks at Chailly at Sunrise</TITLE> <DATE>1865</DATE> <MATERIAL>Oil on canvas</MATERIAL> <DIM Metric=‘cm’> <HEIGHT>30</HEIGHT><WIDTH>60</WIDTH></DIM> <DIM Metric=‘in’> <HEIGHT>11 7/8</HEIGHT><WIDTH>23 3/4</WIDTH></DIM> <LOCATION>San Diego Museum of Art</LOCATION> <IMAGEFile=‘http://192.41.13.240/artchive/m/monet/hayricks.jpg’/> </ARTIFACT> </ARTWORK> </ARTIST>
What’s important • Data first: delayed analysis, preserves data. • Embedded schema: facilitates data transport, readable in the future. • Separation of semantics and presentation: enables information reuse. • Guides and controls data entry • Same meaning can be encoded in multiple formats: • DTD design depends on purpose: Transport, presentation, data entry…
Useful Applications • Prescription for documentation / input • Data transfer between systems (“middle ware”) • Document bases with full query access. • Combine database with XML documents: mission-critical data in tables and DTD, rich extensible structures in DTD only. • Create data for long-term use: even machine readable from paper! • Create information sets for multiple presentation
Final Remark • How to encode meaning without structure ambiguities: => use RDF/ RDFS • How to standardize meaning of element types (tags) ? => use ontologies – e.g. formulated in RDFS!