350 likes | 436 Views
XML Vocabularies: Opportunities for Efficiency and Reliability. Steven R. Newcomb srn@techno.com TechnoTeacher and ISOGEN Int’l Corp. A “Markup Vocabulary” is a list of names. Minimally, XML parsing yields elements with named types (tag names).
E N D
XML Vocabularies:Opportunities for Efficiency and Reliability Steven R. Newcomb srn@techno.com TechnoTeacher and ISOGEN Int’l Corp.
A “Markup Vocabulary” is a list of names • Minimally, XML parsing yields elements with named types (tag names). • The list of these named element types (tag names) is the “vocabulary” of the document. (The names of their attributes are also part of the “vocabulary”.)
I can parse it, but what is it? Is information in XML interchangeable? <hperm> <thone>UDLINGBLON</thone> <kallow>29</kallow> <spec>GUINEA FOWL</spec> <date>2000 3 9</date> </hperm> (Vocabulary: hperm, thone, kallow, spec, date)
XML “Namespaces” are vocabularies. The XML “Namespace” recommendation is a step on the road toward interoperability for XML messages. A “namespace” amounts to an abstract “place” where there is a list of element type names (tag names) and/or attribute names. URIs specify the namespaces in use. There is no requirement that the specified URI is valid, much less that the indicated resource conforms to any sort of specification. XML “Namespaces” provide a way for names to be guaranteed to be unique, and that’s all. XML “Namespaces”
In some sense, an XML resource that uses the names of an XML “namespace” must inherit from it expectations as to the meaning and conventional use of each of the names. (Right? Otherwise, why use it at all?) XML “Namespaces” and expectations
I can parse it, but what is it? Do namespaces help with interchange? <hp:hperm xmlns:hp=’http://www.gov.ng/hp.x’> <hp:thone>UDLINGBLON</hp:thone> <hp:kallow>29</hp:kallow> <hp:spec>GUINEA FOWL</hp:spec> <hp:date>2000 3 9</hp:date> </hp:hperm> (Vocabulary in the HP namespace: hperm, thone, kallow, spec, date)
How to express a namespace so it can be shared? What is the list of names? How to write processing software for a namespace? How to determine whether software for a namespace works according to the expectations of other users of the namespace? How to determine whether an XML resource conforms to the syntactic and semantic requirements of the namespace? How to determine, when information interchange fails, whose software is at fault? (The software that created the XML resource, or the recipient application?) Accountability is vital to interchange. XML “Namespaces:” Unresolved issues
Ideally, an XML resource is self-describing. Since many XML resources use the same vocabularies, it’s efficient to describe them in terms of the vocabularies they use. Anybody who receives a well-described XML resource should be able to interpret it accurately. Anybody should be able to create an XML resource that uses a vocabulary correctly, so that its recipient will interpret it accurately. Vocabularies should be able to support entire industries and areas of human endeavor, in open, multivendor environments. Vocabularies should offer huge advantages in efficiency and reliability. XML Vocabularies in open environments
Closed syndicates and would-be cartels need to resolve the same issues, so that their XML messages will interoperate. It’s extremely inefficient for each syndicate to invent the methodologies and tools for guaranteeing reliable vocabulary-based interoperability. It’s also a net contraction in the noosphere of the syndicate. Where to find technical expertise? How to maintain it? Etc. Enlightened self-interest demands that the same methodologies and tools that support open interoperability be used internally. Vocabularies should offer huge advantages in efficiency and reliability. XML Vocabularies in closed environments
Vocabularies can be used to make XML resources fully self-describing, fully interchangeable and fully interoperable, down to the last syntactic and semantic feature. This can be accomplished using existing W3C and ISO recommendations and standards, all from the XML and SGML families of recommendations and standards. Alternatively, the same principles could be applied using different modeling syntaxes, purpose-built for the Web. …but if it can be done without reinventing everything, why bother? Methodologies and Tools for Vocabularies
The first stage of vocabulary processing can be accomplished by a single generic piece of software, the XML parser. XML parsers don’t do much vocabulary processing yet. First stage: vocabulary syntax processing and validation: Check for conformance of the XML resource to each of the vocabularies it uses, to see whether invalid names were used. Check for conformance to the structural model (DTD?) of each vocabulary used. Is each name used in a valid context with respect to the other names in the same namespace? Meanings of names may change with context! Check for conformance of data and attribute values to lexical models of valid data of each element/attribute in each vocabulary. Processing of XML resources: 2 stages
Second stage of vocabulary processing is semantic interpretation of the vocabularies. Since all vocabularies are different, according to the natures of their applications, no generic piece of software can interpret all vocabularies. However, a paradigm in which vocabulary-specific processing need never include code which is duplicated in software that processes any other vocabulary could offer significant efficiencies and enhanced reliability. More on this in a moment. Processing of XML resources: 2 stages
(Reminder: Stage 1 is vocabulary syntax processing and validation.) Provide a formalism for the expression of vocabularies: the list of names, the contexts in which names can be used, and lexical models for the data contained in elements and in the value of attributes named in vocabularies. The existing DTD formalism can already do most of this. Let’s not force applications to duplicate the functionality of checking the validity of vocabulary usage in XML resources. Let’s build it into re-usable validating XML parsers. They already validate against DTDs. Why not use that existing functionality for inherited vocabularies? More efficiency/reliability in Stage 1
There is an ISO standard for declaring, in an XML resource, conformance to one or more inheritable XML vocabularies. (In the ISO context, such a vocabulary is called an “inheritable information architecture”.) Vocabularies can inherit from other vocabularies. A single XML resource can inherit from more than one vocabulary. Vocabularies are expressed using ordinary DTD syntax (with minor, optional enhancements). Demonstration using the Topic Map inheritable vocabulary. SX already validates inherited vocabularies.
It would be great to be able to document vocabularies more effectively than we can now. How to document vocabularies?
Which constructs are the comments about? <!-- Indigenous person's hunting permit --> <!ELEMENT hperm (thone, kallow, spec, date)> <!ELEMENT thone (#PCDATA)> <!-- person's name --> <!ELEMENT kallow (#PCDATA)> <!-- kill allowance --> <!ELEMENT spec (#PCDATA)> <!-- species --> <!ELEMENT date (#PCDATA)> <!-- expiration date --> <!-- lexical model: YYYY MM DD --> <hperm> <thone>UDLINGBLON</thone> <kallow>29</kallow> <spec>GUINEA FOWL</spec> <date>2000 3 9</date> </hperm>
Topic maps are an extremely powerful way of documenting DTDs. ...but that’s another story for another time. Documenting vocabularies
Reminder: “Stage 2” is application-specific (i.e., vocabulary-specific) processing of XML resources, after parsing and other processing common to all XML resources has already been done. Stage 2 is about resource interoperability, not just about interchangeability. It’s about how we can guarantee that everyone understands the resource in the same way. It’s about the meaning of each name in a vocabulary . It’s about the meaning of the data associated with each vocabulary name in each resource that uses the vocabulary. It’s about expectations: the resource creator’s expectations about what will be understood by recipients of the resource, and the recipients’ expectations about the kinds of things that a resource that uses a certain vocabulary can say. More efficiency/reliability in Stage 2
No generic processor can understand all vocabularies. In general, a special processor is needed for each vocabulary. Still, there are huge opportunities, even in Stage 2, for efficiency and reliability: There can be a common way to express vocabulary-specific semantics. At least some of these expressions can be formal and machine-readable, so tools can be built that enhance the productivity of application builders. Many XML resources can inherit multiple vocabularies, thus recycling existing knowledge about vocabularies, and avoiding redundant learning cycles. (Example: XLL combined with Biztalk.) A re-usable software engine can be built for each vocabulary, and means for plugging such engines into applications can be developed. (Same example applies.) More efficiency/reliability in Stage 2
In Stage 1 of XML resource processing, models of the structural and lexical requirements associated with each vocabulary can drive a generic parsing/validating process. In Stage 2 of XML resource processing, models of the abstract information sets that can be conveyed by specific vocabularies can be created. These “abstract APIs” give names to each of the properties of the information set that “emerges” from processing a vocabulary. Abstract API models are contracts between programmers, just as a DTD is a contract between information users and providers. In an actual implementation of a vocabulary processing engine, these property names can become function calls (or whatever). In other words, these abstract information set models can drive a generic engine-building process that produces vocabulary-specific engines. Modeling is the key
All XML resources convey information that really has two forms: The interchangeable (but otherwise useless), XML form, and The parsed, processed, application-internal form. “Stages 1 and 2” are about the conversion from the interchange form to the useful form. The other transformation -- from the useful form to the interchange form -- is at least equally important. For reliable, efficient information interchange, the nature of both transformations must be documented. It would be great if the URI of the vocabulary’s “namespace” pointed at a document that had both models, and explained the algorithms involved in transforming information between them. Bi-directional transformation
The fallacy is: the structure of an XML resource should also be the API to the information it contains. Trying to make the element structure also be the API makes it impossible to have both a good interchange structure and a good API. The attempt introduces inefficiency and invites unreliability of information interchange. The Document Object Model (DOM) is an API to the generic structure of XML resources. It is notand can never be the API to the information sets conveyed by all vocabularies. If, e.g., the XLL vocabulary’s functionality gets built into the DOM, what vocabulary’s functionality shouldn’t be built into the DOM? No committee can possibly do all this work! A common fallacy: DTD is API
Desirable qualities in an interchange syntax • Maximal appropriateness to the information it conveys • intrinsic character of information well reflected in interchange structure. • Communications efficiency • no redundancy • Validatability • no ambiguity • Neutrality • no hidden assumptions about platform, vendor or application • Self-description • conformance to intelligible, well-documented formal model
Interchange syntax model is a contract • DTD is a contract between • information creators • information consumers • applications developers • DTD enhanced with type checking, lexical typing, etc., is a more detailed contract between the same players
Desirable qualities in an Abstract API • Maximal convenience for applications developers • Abstract API is intuitive for learning and use • Abstract APIs often need redundant access methods, for the convenience of programmers • Processing tasks common to all applications (beyond parsing and validation) are supported by the implementation of the abstract API. • Abstract API should include both: • Properties directly derivable from syntactic structure of interchange form. • Properties implicit in architecture but not reflected in syntactic structures. • Neutrality • no hidden assumptions about platform, vendor or application. • Self-description • API is intelligible, well-documented
Abstract API model is a contract, too • ...between programmers of applications that, with respect to a given vocabulary: • Create XML resources. • Receive XML resources and use the information they convey. • Support the creation of XML resources that link to the emergent properties of other resources. • Support the querying of XML resources with respect to the values of specific emergent properties.
Two sides of one coin • The interchange syntax model and the abstract API are two aspects of the same information set: • Syntax model = consensus about the interchange format of the information set • Abstract API = consensus about the abstract properties of the information set
Enhanced syntactic modeling capabilities for generic XML processing/validation. Especially: Means for inheriting multiple vocabularies in XML instances, and for proving that they are all used correctly. Note: lexical modeling features, and many other syntactic enhancements can be made to XML by means of vocabularies. Semantic modeling capabilities that allow us to give names to the emergent properties of XML resources that use vocabularies. A convention, such as that which exists for XML “Namespaces” today, for pointing to these models from within XML resources, so as to indicate the use of a given vocabulary. XML needs:
Semantic modeling: emergent properties • Example of an “emergent” property: The property of being a target of an xlink (considering XLL as a vocabulary, as it is in ISO-land). • All emergent properties of a vocabulary must be described clearly, comprehensively, unambiguously, and formally, because • accuracy and reliability are important. • the information is expected to be useful in multi-vendor application environments (if not, why inherit a vocabulary at all?). • implementation of vocabulary-specific applications must be done at reasonable cost.
Semantic validation becomes a side-effect • Computing an emergent property value often isn’t possible without validating the interchanged information on which the computation is based. • For example, if an element that inherits from a vocabulary specifies a "start-time" attribute and an "end-time" attribute, we may intend that the duration of time between the start-time and the end-time be calculable and that it fall within a certain range (or at least be non-negative). In any case, we can’t calculate the value of the “duration” property unless the start-time and end-time values exist and are amenable to calculation.
A standard property language exists… • It’s called "Property Sets” • A property set is an XML document that conforms to the ISO standard DTD for property sets. • Already in commercial use; the software already works with XML. • Every class of information component (“node”), and every property of every class, has a unique name. • These names can be used in queries. • This whole idea is often called "the Grove Paradigm.” It’s the basis of SGML processing, and the SGML Property Set aided the development of the DOM.
In the Grove Paradigm... • Vocabulary-specific engines can be plugged together in applications that support XML resources that use multiple vocabularies. • Vocabulary-specific engines generate a "grove" (object graph with relevant Property Set as schema) from any vocabulary-conforming XML instance. • Vocabulary-specific engines can mature and offer reliable semantic validation and processing services in a variety of application contexts, instead of being rebuilt in each application. • Time and cost of developing applications is reduced, while reliability of information interchange increases.
The Grove Paradigm is Portable The Grove Paradigm is highly portable: it can be used with any notation, not just XML and SGML. Property sets can be used as a way to represent consensus about how to address the abstract properties of any notation. Think about it: a vocabulary is a notation. (And XML is a notation for vocabulary-notations.) Let’s look at some groves! (GroveMinder demo.)
Summary: Designing XML Vocabularies • Questions to ask: • Must certain semantic processing and validation operations be performed by all applications of this vocabulary? • Will more than one application have to deal with this vocabulary? • If so, its syntactic requirements deserve to be made explicit in a DTD (or something like a DTD), and • A property set (or other explicit Abstract API) defined for it will pay big dividends • in software reuse • in achieving widespread consensus about what the vocabulary really means • in determining what went wrong when vocabulary-mediated information interchange fails
The preceding SX and GroveMinder demos are available fromSteve Newcomb srn@techno.com