1 / 54

The XML/SGML Conundrum

The XML/SGML Conundrum. Presented by Joseph V. Gangemi Senior Consultant J.V.G. Consulting Services © 2005. Agenda. Compare aspects of XML and SGML Explore rational for choosing between them Discuss affect of XML on publishing applications Some personal thoughts on SGML

adamdaniel
Download Presentation

The XML/SGML Conundrum

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The XML/SGML Conundrum Presented by Joseph V. Gangemi Senior Consultant J.V.G. Consulting Services ©2005

  2. Agenda • Compare aspects of XML and SGML • Explore rational for choosing between them • Discuss affect of XML on publishing applications • Some personal thoughts on SGML • Where do we go from here? • Audience participation is encouraged J.V.G. Consulting Services © 2005

  3. What is a Conundrum • A question or problem having only a conjectural answer • Should I use SGML or XML? • Do I need a DTD or an XSD? • Will publishing survive the XML marketing hype? • An intricate and difficult problem • Does XML address the needs of my application J.V.G. Consulting Services © 2005

  4. What is SGML? • Markup Language for Text Processing • ISO Standard 8879 • Syntax rules for defining a markup language • Does not include a set of tags • Full-featured to support a wide array of applications • Perceived as complex, but in reality, text applications are complex • Not all features are used in all applications J.V.G. Consulting Services © 2005

  5. What is XML? • Subset of SGML • Designed for the transport of text over the Internet • Extensible Markup Language meant to complement HTML • More generic; not a formatting language • Structure rather than format oriented • Less complex from a programmer’s perspective • Not necessarily the user’s! J.V.G. Consulting Services © 2005

  6. The XML Myth • XML does not require a DTD or XSD • Yes and No • From a programming perspective, the DTD or XSD is optional (for some applications) • This is also true for SGML • From a practical design perspective, IT IS MANDATORY • No inherent benefit to XML (or SGML) without one • If you cannot define and enforce a tag set, calling it a language is a misnomer. J.V.G. Consulting Services © 2005

  7. Choosing between XML and SGML (Putting aside the marketing appeal and looking at the technology)

  8. XML / SGML Feature Sets • Review SGML features omitted from XML • Original Intent of feature • Reason for omission from XML • Effect of Omission on Publishing Applications • How is user affected? • How is product vendor affected? J.V.G. Consulting Services © 2005

  9. SGML Declaration • Describes the processing environment in which the document can be processed. • Identifies character set • Specifies features required for successful processing • Purely technical information • User has minimal, if any, awareness of it • Product vendors benefit by being able to tailor product to each client’s application environment • Once environment is defined for an organization, it has minimal value (except, of course, to the vendor). • XML has a defined environment (World Wide Web) • In effect, it has a built-in SGML declaration • Processing is restricted to the defined environment J.V.G. Consulting Services © 2005

  10. SGML DCL Features • XML defaults the following features to NO (which means they are not supported) • DATATAG • OMITTAG • RANK • LINK • CONCUR • SUBDOC • SHORTREF J.V.G. Consulting Services © 2005

  11. Datatag • A string of characters that acts as both data and the end-tag of the currently open element. • Actually defined poorly since its purpose is to delimit repetitive elements. If it closes an element, it inherently opens the next one too. • Never implemented because short references became a better way to achieve same objective. • Considered an irrelevant technique and is not supported by vendors. J.V.G. Consulting Services © 2005

  12. Omittag • Refers to tag minimization • Start and end tags can be omitted under certain parsing conditions • Method of reducing the character count or physical size of a file back when it meant something to do that  • Vendor products make the feature moot. J.V.G. Consulting Services © 2005

  13. Rank • Rank IS rank • Irrelevant concept, poorly conceived, badly defined, and eschewed by we elite purists in the industry • Attempt to take linear tagging for typesetting and treat it hierarchically • Even Charles wants to see this gone! • No relevance to vendor or user J.V.G. Consulting Services © 2005

  14. Link • Method whereby a process can be associated with an element through an attribute • A mechanism for inserting process-specific information into a document • Attempt to connect some esoteric concepts that rambled around in Charles mind with some real world processing • Difficult to define, impossible to understand, and not considered relevant to most forms of text processing • Users ignore it; vendors shun it. J.V.G. Consulting Services © 2005

  15. Concur • Concur • Good concept, but never fully grasped by the user community • Concurrent document structures coexisting in the same instance. • Primarily meant to express the document structure and the formatting structure associated with the document concurrently • Never implemented • Replaced (kind-of) by namespaces J.V.G. Consulting Services © 2005

  16. Subdoc • Ability to include subordinate documents into a master document • Defined primarily to support things like anthologies • Subdocument may have its own DTD, but must conform to a single SGML declaration • Limited value to the user • Overkill for the vendor • Workarounds are simple and more practical J.V.G. Consulting Services © 2005

  17. Short References • Character strings that represent an entity reference • Originally intended to reduce keystrokes • Allows characters in content to act as mark-up • A Shortref declaration defines a set of string-to-entity mappings • Named set • Different mappings for same string in different sets • Activated by USEMAP declaration that associates map set to element • In effect for duration of element J.V.G. Consulting Services © 2005

  18. Feature Recap • DATATAG – not supported in SGML or XML • No affect on user or vendor • OMITTAG – not supported in XML • Vendor tools have reduced its value • RANK – not supported in SGML or XML • LINK – not supported by vendors or XML • CONCUR – not supported by vendors or XML • SUBDOC – not supported in XML • minimal SGML support • SHORTREF – not supported in XML • No negative impact on the publishing process J.V.G. Consulting Services © 2005

  19. Shared Features • SGML and XML share many features • Not always equally • Public and System Identifiers • Notations (with restrictions) • Parameter entities (with restrictions) • Marked sections (with restrictions) • Character and Entity References J.V.G. Consulting Services © 2005

  20. Public Identifiers • Used to identify something associated with but separate from the document instance • A consistent way to refer to another entity regardless of what it is or where it is • Must be unique within its processing universe • Global: must be registered with central authority • Local: unregistered, but managed within the processing scope • Resolved through a standard catalog entry • XML copped out and requires system identifiers J.V.G. Consulting Services © 2005

  21. System Identifiers • A system identifier points to a specific object • XML uses a URI to specify the entity (usually a file) • Since XML is designed for the Web, the URI is invariably a URL • Entity consider static and not likely to change • In SGML, a System Identifier usually points to a physical file • Seldom used in a production application • Entities considered dynamic and subject to change • Unregistered, local public identifiers are preferred J.V.G. Consulting Services © 2005

  22. Public vs System Identifiers • XML is designed for the Web • Represents text in a specific instance for a specific purpose • URLs are preferred method of accessing external entities • System Identifiers support URLs easily • SGML is non-denominational • Designed for text in any environment • Supports information management from data capture, through editorial processing, to finished product • Public Identifiers are more versatile and better suited to changing entities • XML syntax supports Public Identifiers • Not all XML-compliant applications do. • XML products derived from SGML products usually do. J.V.G. Consulting Services © 2005

  23. Notation and Parameter Entities • Notation is similar in XML and SGML • Vendor support is usually proprietary • More conceptual than practical • No industry-wide implementation model across platforms • Parameter entities are supported in XML, but XML restricts their use to DTDs; i.e., not permitted in marked sections (or XSDs) J.V.G. Consulting Services © 2005

  24. Marked Sections • Marked Sections are restricted to CDATA in XML • Recognizes string <![CDATA[ before the content and ]]> after the content • CDATA is not parsed by the parser; i.e., embedded tags are ignored • SGML allows any type of data in a marked section, even parsable data • You can control if the marked section is included (parsed or at least passed to the application) or ignored by the parser J.V.G. Consulting Services © 2005

  25. Affect on Publishing • XML restrictions limit value of Marked Sections in a publishing application • Need include / ignore option • Require parameter entity support to implement include / ignore • Workarounds are tedious, especially if Marked Sections are not on element boundaries J.V.G. Consulting Services © 2005

  26. More XML Variants • The PIC (processing instruction close) delimiter is ?> • Quantities and capacities are effectively unlimited • Names are case sensitive • (not necessarily a good thing) • Underscore and colon are allowed in names • Names can use Unicode characters and are not restricted to ASCII • Unicode is not widely supported in publishing systems • SGML can and does accept these variants by modifying the SGML declaration J.V.G. Consulting Services © 2005

  27. Built-in XML Entity References • Predefined entities in XML • &amp; for ampersand • &lt; for less than (<) • &gt; for greater than (>) • &apos; for apostrophe (’) • &quot; for quotation mark (”) • Not predefined in SGML • Must be declared if used • Programmer’s convenience if DTD not used. J.V.G. Consulting Services © 2005

  28. External Entity References • References to external data entities in content are not supported in XML • Significant restriction to data organization facilities built into SGML • Often used to represent embedded symbols in running text • Unicode replaces this approach, but not always supported • External entities must be managed within the application’s environment • Simple workaround • Use empty element with an attribute whose value is declared as an entity • Affects DTD or XSD because element must be declared J.V.G. Consulting Services © 2005

  29. Choosing XML or SGML • Compare inherent features • Identify features that apply to your application • Estimate effort to support omitted features • Imprecise SWAG is usually sufficient • Will DTD or XSD define your doctype(s)? • If you can’t define it, it won’t work. • Can you use a reasonable workaround? J.V.G. Consulting Services © 2005

  30. Document Type Definition orXML Schema Definition

  31. Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs / XSDs are better for text applications • DTDs / XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005

  32. Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • DTDs / XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005

  33. Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005

  34. Here is where I stand • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • XSDs are better for data processing applications • DTDs are NOT obsolete Dem’s fighting woids !!! J.V.G. Consulting Services © 2005

  35. DTDs Are Better • Easy to use as a working notation during document design • Easier for less technical people (like editors) to follow as a visual notation for a document’s structure • Easier for a person to read and interpret • Clear, concise, and user friendly • Able to be processed by computers as well J.V.G. Consulting Services © 2005

  36. DTDs are Better • Specifically geared to address text notation requirements • Text content • Text order and hierarchy • Text appearance (required or optional) • Text occurrence (repeatable) • Other data characteristics are not relevant to the application • Established methodology with existing support from vendor community J.V.G. Consulting Services © 2005

  37. DTDs are Better • Designed for text processing • Supports editorial activity • Easily changed as needs evolve • Addresses text content issues with exceptions • Inclusions allow elements to occur randomly within text (good when used correctly) • Exclusions eliminate recursion that could introduce processing anomalies • reduces tag set substantially • External entities can be declared in the external DTD subset at the start of the instance J.V.G. Consulting Services © 2005

  38. XSDs Are Better • Data applications have different requirements • Document structure is simple • If data is in a database, design issues are minimal • Primarily used by programmers and other technical personnel (not editors, per se) • Characteristics of data are relevant J.V.G. Consulting Services © 2005

  39. XSDs Are Better • Simplified parser because schema is in same syntax as data • Simplified parsing because data is well-formed • Physical characteristics of data can be validated as well as structure • No exceptions to contend with J.V.G. Consulting Services © 2005

  40. XSDs are Better • Variations in data are minimal • Not intended for editorial processing • Exceptions cannot occur because content is in the instance • No inclusions • No exclusions • External entities need not be declared because instance contains specific URLs • Entity declarations are not relevant because all entity references are resolved in the instance J.V.G. Consulting Services © 2005

  41. Another Myth • XML Schema is easier to parse than a DTD • Syntactically maybe, because it uses the same XML parser as the content • Semantically, not really since the same hierarchical structure expressed in the DTD must be determined; i.e., the internal document object must be built • Additional support for the data types and corresponding validation processing also make the process more complex J.V.G. Consulting Services © 2005

  42. Ah, but it’s free!!! • And you get what you pay for • Just the parser is free, there is so much more to the application than the parser • And there is XSL in two flavors • XSLT for data transformation • XSL-FO for data formatting • Unproven technology with horrendous syntax • Expectations exceed language’s potential • Ah, but it’s free!!! J.V.G. Consulting Services © 2005

  43. Where Should We Go From Here?

  44. Clean Up SGML • Eliminate features that time has proven irrelevant • Adjust Basic Concrete Syntax to meet today’s needs • Improve support for multiple DTDs • Simplify parsing requirements wherever possible • Add <!DATATYPE …> declaration to DTD • For die-hards who think its necessary • Incorporate XML variants (already done) • Web SGML Annex J.V.G. Consulting Services © 2005

  45. XML Extensions to SGML • HCRO delimiter (for hex numeric character references); for XML this is &#x • EMPTYNRM feature that allows elements declared EMPTY to have end-tags • NESTC delimiter (NET-enabling start-tag close) – permits empty tag, e.g. <tag/> • Duplicate enumerated attribute tokens are allowed • Goldfarbism, specifically rejected by committee • Relaxation of rules on use of parameter entity references inside groups • The rules were wrong anyway • Multiple ATTLIST declarations for a single element type • ATTLIST declarations which don't declare any attributes • What? Must be a programming thing • KEEPRSRE feature that turns off SGML's rules for ignoring RSs and REs • The rules are inconsistent anyway and sometimes outright wrong • Fully-tagged SGML documents need not be type-valid • This makes all XML documents, including those that are well-formed but not valid, conforming SGML documents • Predefined data character entities in the SGML declaration • (for ampersand, less than, and so on) • Unlimited capacities and quantities J.V.G. Consulting Services © 2005

  46. Extend XML • Add publishing features that time has proven relevant • Support different character sets • ASCII is still around • External entity references in content • Catalog support for Public Identifiers • Full Marked Section support • Continued support for the DTD • Add <!DATATYPE … > declaration • Adjust Basic Concrete Syntax to meet publishing’s needs J.V.G. Consulting Services © 2005

  47. Use the Right Approach • SGML is for Text Processing • Data capture and editorial processing • Complex entity management • Reusable text entities • Illustration management • XML is for delivery over the Web • Designed for the Web environment • Converts easily to HTML J.V.G. Consulting Services © 2005

  48. Avoid the Hype • SGML is an obsolete technology • Proven over 20 years in the field for practical applications • Adaptable to meet wide array of text processing needs • DTDs are bad for your programmers • Parsers already exist • Proven methodology • User friendly J.V.G. Consulting Services © 2005

  49. More Hype • XML is better than SGML • XML IS SGML • XML is a SUBSET of SGML • More limited; less capabilities • XSDs will replace DTDs • For data processing applications, why not • Overkill for text applications • And more restrictive • No entity references! J.V.G. Consulting Services © 2005

  50. What Should Publishers Do? • Stop ignoring your SGML vendor • Use the four-letter acronym, it’s OK • SGML is your friend • Tell your boss it’s XML • Can they tell the difference? • It’s got angle brackets, doesn’t it? • Use your clout to get what you need • Stop following the leader (remember the lemmings) • Where do you want to go today? (MS jingle in background) • Don’t accept the programmer’s position as your own • You have different needs • Free parsers do not make better applications J.V.G. Consulting Services © 2005

More Related