550 likes | 1.06k Views
The XML/SGML Conundrum. Presented by Joseph V. Gangemi Senior Consultant J.V.G. Consulting Services © 2005. Agenda. Compare aspects of XML and SGML Explore rational for choosing between them Discuss affect of XML on publishing applications Some personal thoughts on SGML
E N D
The XML/SGML Conundrum Presented by Joseph V. Gangemi Senior Consultant J.V.G. Consulting Services ©2005
Agenda • Compare aspects of XML and SGML • Explore rational for choosing between them • Discuss affect of XML on publishing applications • Some personal thoughts on SGML • Where do we go from here? • Audience participation is encouraged J.V.G. Consulting Services © 2005
What is a Conundrum • A question or problem having only a conjectural answer • Should I use SGML or XML? • Do I need a DTD or an XSD? • Will publishing survive the XML marketing hype? • An intricate and difficult problem • Does XML address the needs of my application J.V.G. Consulting Services © 2005
What is SGML? • Markup Language for Text Processing • ISO Standard 8879 • Syntax rules for defining a markup language • Does not include a set of tags • Full-featured to support a wide array of applications • Perceived as complex, but in reality, text applications are complex • Not all features are used in all applications J.V.G. Consulting Services © 2005
What is XML? • Subset of SGML • Designed for the transport of text over the Internet • Extensible Markup Language meant to complement HTML • More generic; not a formatting language • Structure rather than format oriented • Less complex from a programmer’s perspective • Not necessarily the user’s! J.V.G. Consulting Services © 2005
The XML Myth • XML does not require a DTD or XSD • Yes and No • From a programming perspective, the DTD or XSD is optional (for some applications) • This is also true for SGML • From a practical design perspective, IT IS MANDATORY • No inherent benefit to XML (or SGML) without one • If you cannot define and enforce a tag set, calling it a language is a misnomer. J.V.G. Consulting Services © 2005
Choosing between XML and SGML (Putting aside the marketing appeal and looking at the technology)
XML / SGML Feature Sets • Review SGML features omitted from XML • Original Intent of feature • Reason for omission from XML • Effect of Omission on Publishing Applications • How is user affected? • How is product vendor affected? J.V.G. Consulting Services © 2005
SGML Declaration • Describes the processing environment in which the document can be processed. • Identifies character set • Specifies features required for successful processing • Purely technical information • User has minimal, if any, awareness of it • Product vendors benefit by being able to tailor product to each client’s application environment • Once environment is defined for an organization, it has minimal value (except, of course, to the vendor). • XML has a defined environment (World Wide Web) • In effect, it has a built-in SGML declaration • Processing is restricted to the defined environment J.V.G. Consulting Services © 2005
SGML DCL Features • XML defaults the following features to NO (which means they are not supported) • DATATAG • OMITTAG • RANK • LINK • CONCUR • SUBDOC • SHORTREF J.V.G. Consulting Services © 2005
Datatag • A string of characters that acts as both data and the end-tag of the currently open element. • Actually defined poorly since its purpose is to delimit repetitive elements. If it closes an element, it inherently opens the next one too. • Never implemented because short references became a better way to achieve same objective. • Considered an irrelevant technique and is not supported by vendors. J.V.G. Consulting Services © 2005
Omittag • Refers to tag minimization • Start and end tags can be omitted under certain parsing conditions • Method of reducing the character count or physical size of a file back when it meant something to do that • Vendor products make the feature moot. J.V.G. Consulting Services © 2005
Rank • Rank IS rank • Irrelevant concept, poorly conceived, badly defined, and eschewed by we elite purists in the industry • Attempt to take linear tagging for typesetting and treat it hierarchically • Even Charles wants to see this gone! • No relevance to vendor or user J.V.G. Consulting Services © 2005
Link • Method whereby a process can be associated with an element through an attribute • A mechanism for inserting process-specific information into a document • Attempt to connect some esoteric concepts that rambled around in Charles mind with some real world processing • Difficult to define, impossible to understand, and not considered relevant to most forms of text processing • Users ignore it; vendors shun it. J.V.G. Consulting Services © 2005
Concur • Concur • Good concept, but never fully grasped by the user community • Concurrent document structures coexisting in the same instance. • Primarily meant to express the document structure and the formatting structure associated with the document concurrently • Never implemented • Replaced (kind-of) by namespaces J.V.G. Consulting Services © 2005
Subdoc • Ability to include subordinate documents into a master document • Defined primarily to support things like anthologies • Subdocument may have its own DTD, but must conform to a single SGML declaration • Limited value to the user • Overkill for the vendor • Workarounds are simple and more practical J.V.G. Consulting Services © 2005
Short References • Character strings that represent an entity reference • Originally intended to reduce keystrokes • Allows characters in content to act as mark-up • A Shortref declaration defines a set of string-to-entity mappings • Named set • Different mappings for same string in different sets • Activated by USEMAP declaration that associates map set to element • In effect for duration of element J.V.G. Consulting Services © 2005
Feature Recap • DATATAG – not supported in SGML or XML • No affect on user or vendor • OMITTAG – not supported in XML • Vendor tools have reduced its value • RANK – not supported in SGML or XML • LINK – not supported by vendors or XML • CONCUR – not supported by vendors or XML • SUBDOC – not supported in XML • minimal SGML support • SHORTREF – not supported in XML • No negative impact on the publishing process J.V.G. Consulting Services © 2005
Shared Features • SGML and XML share many features • Not always equally • Public and System Identifiers • Notations (with restrictions) • Parameter entities (with restrictions) • Marked sections (with restrictions) • Character and Entity References J.V.G. Consulting Services © 2005
Public Identifiers • Used to identify something associated with but separate from the document instance • A consistent way to refer to another entity regardless of what it is or where it is • Must be unique within its processing universe • Global: must be registered with central authority • Local: unregistered, but managed within the processing scope • Resolved through a standard catalog entry • XML copped out and requires system identifiers J.V.G. Consulting Services © 2005
System Identifiers • A system identifier points to a specific object • XML uses a URI to specify the entity (usually a file) • Since XML is designed for the Web, the URI is invariably a URL • Entity consider static and not likely to change • In SGML, a System Identifier usually points to a physical file • Seldom used in a production application • Entities considered dynamic and subject to change • Unregistered, local public identifiers are preferred J.V.G. Consulting Services © 2005
Public vs System Identifiers • XML is designed for the Web • Represents text in a specific instance for a specific purpose • URLs are preferred method of accessing external entities • System Identifiers support URLs easily • SGML is non-denominational • Designed for text in any environment • Supports information management from data capture, through editorial processing, to finished product • Public Identifiers are more versatile and better suited to changing entities • XML syntax supports Public Identifiers • Not all XML-compliant applications do. • XML products derived from SGML products usually do. J.V.G. Consulting Services © 2005
Notation and Parameter Entities • Notation is similar in XML and SGML • Vendor support is usually proprietary • More conceptual than practical • No industry-wide implementation model across platforms • Parameter entities are supported in XML, but XML restricts their use to DTDs; i.e., not permitted in marked sections (or XSDs) J.V.G. Consulting Services © 2005
Marked Sections • Marked Sections are restricted to CDATA in XML • Recognizes string <![CDATA[ before the content and ]]> after the content • CDATA is not parsed by the parser; i.e., embedded tags are ignored • SGML allows any type of data in a marked section, even parsable data • You can control if the marked section is included (parsed or at least passed to the application) or ignored by the parser J.V.G. Consulting Services © 2005
Affect on Publishing • XML restrictions limit value of Marked Sections in a publishing application • Need include / ignore option • Require parameter entity support to implement include / ignore • Workarounds are tedious, especially if Marked Sections are not on element boundaries J.V.G. Consulting Services © 2005
More XML Variants • The PIC (processing instruction close) delimiter is ?> • Quantities and capacities are effectively unlimited • Names are case sensitive • (not necessarily a good thing) • Underscore and colon are allowed in names • Names can use Unicode characters and are not restricted to ASCII • Unicode is not widely supported in publishing systems • SGML can and does accept these variants by modifying the SGML declaration J.V.G. Consulting Services © 2005
Built-in XML Entity References • Predefined entities in XML • & for ampersand • < for less than (<) • > for greater than (>) • ' for apostrophe (’) • " for quotation mark (”) • Not predefined in SGML • Must be declared if used • Programmer’s convenience if DTD not used. J.V.G. Consulting Services © 2005
External Entity References • References to external data entities in content are not supported in XML • Significant restriction to data organization facilities built into SGML • Often used to represent embedded symbols in running text • Unicode replaces this approach, but not always supported • External entities must be managed within the application’s environment • Simple workaround • Use empty element with an attribute whose value is declared as an entity • Affects DTD or XSD because element must be declared J.V.G. Consulting Services © 2005
Choosing XML or SGML • Compare inherent features • Identify features that apply to your application • Estimate effort to support omitted features • Imprecise SWAG is usually sufficient • Will DTD or XSD define your doctype(s)? • If you can’t define it, it won’t work. • Can you use a reasonable workaround? J.V.G. Consulting Services © 2005
Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs / XSDs are better for text applications • DTDs / XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005
Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • DTDs / XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005
Where do you stand? • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • XSDs are better for data processing applications • DTDs are obsolete J.V.G. Consulting Services © 2005
Here is where I stand • DTDs are better than XSDs in general • XSDs are better than DTDs in general • DTDs are better for text processing applications • XSDs are better for data processing applications • DTDs are NOT obsolete Dem’s fighting woids !!! J.V.G. Consulting Services © 2005
DTDs Are Better • Easy to use as a working notation during document design • Easier for less technical people (like editors) to follow as a visual notation for a document’s structure • Easier for a person to read and interpret • Clear, concise, and user friendly • Able to be processed by computers as well J.V.G. Consulting Services © 2005
DTDs are Better • Specifically geared to address text notation requirements • Text content • Text order and hierarchy • Text appearance (required or optional) • Text occurrence (repeatable) • Other data characteristics are not relevant to the application • Established methodology with existing support from vendor community J.V.G. Consulting Services © 2005
DTDs are Better • Designed for text processing • Supports editorial activity • Easily changed as needs evolve • Addresses text content issues with exceptions • Inclusions allow elements to occur randomly within text (good when used correctly) • Exclusions eliminate recursion that could introduce processing anomalies • reduces tag set substantially • External entities can be declared in the external DTD subset at the start of the instance J.V.G. Consulting Services © 2005
XSDs Are Better • Data applications have different requirements • Document structure is simple • If data is in a database, design issues are minimal • Primarily used by programmers and other technical personnel (not editors, per se) • Characteristics of data are relevant J.V.G. Consulting Services © 2005
XSDs Are Better • Simplified parser because schema is in same syntax as data • Simplified parsing because data is well-formed • Physical characteristics of data can be validated as well as structure • No exceptions to contend with J.V.G. Consulting Services © 2005
XSDs are Better • Variations in data are minimal • Not intended for editorial processing • Exceptions cannot occur because content is in the instance • No inclusions • No exclusions • External entities need not be declared because instance contains specific URLs • Entity declarations are not relevant because all entity references are resolved in the instance J.V.G. Consulting Services © 2005
Another Myth • XML Schema is easier to parse than a DTD • Syntactically maybe, because it uses the same XML parser as the content • Semantically, not really since the same hierarchical structure expressed in the DTD must be determined; i.e., the internal document object must be built • Additional support for the data types and corresponding validation processing also make the process more complex J.V.G. Consulting Services © 2005
Ah, but it’s free!!! • And you get what you pay for • Just the parser is free, there is so much more to the application than the parser • And there is XSL in two flavors • XSLT for data transformation • XSL-FO for data formatting • Unproven technology with horrendous syntax • Expectations exceed language’s potential • Ah, but it’s free!!! J.V.G. Consulting Services © 2005
Clean Up SGML • Eliminate features that time has proven irrelevant • Adjust Basic Concrete Syntax to meet today’s needs • Improve support for multiple DTDs • Simplify parsing requirements wherever possible • Add <!DATATYPE …> declaration to DTD • For die-hards who think its necessary • Incorporate XML variants (already done) • Web SGML Annex J.V.G. Consulting Services © 2005
XML Extensions to SGML • HCRO delimiter (for hex numeric character references); for XML this is &#x • EMPTYNRM feature that allows elements declared EMPTY to have end-tags • NESTC delimiter (NET-enabling start-tag close) – permits empty tag, e.g. <tag/> • Duplicate enumerated attribute tokens are allowed • Goldfarbism, specifically rejected by committee • Relaxation of rules on use of parameter entity references inside groups • The rules were wrong anyway • Multiple ATTLIST declarations for a single element type • ATTLIST declarations which don't declare any attributes • What? Must be a programming thing • KEEPRSRE feature that turns off SGML's rules for ignoring RSs and REs • The rules are inconsistent anyway and sometimes outright wrong • Fully-tagged SGML documents need not be type-valid • This makes all XML documents, including those that are well-formed but not valid, conforming SGML documents • Predefined data character entities in the SGML declaration • (for ampersand, less than, and so on) • Unlimited capacities and quantities J.V.G. Consulting Services © 2005
Extend XML • Add publishing features that time has proven relevant • Support different character sets • ASCII is still around • External entity references in content • Catalog support for Public Identifiers • Full Marked Section support • Continued support for the DTD • Add <!DATATYPE … > declaration • Adjust Basic Concrete Syntax to meet publishing’s needs J.V.G. Consulting Services © 2005
Use the Right Approach • SGML is for Text Processing • Data capture and editorial processing • Complex entity management • Reusable text entities • Illustration management • XML is for delivery over the Web • Designed for the Web environment • Converts easily to HTML J.V.G. Consulting Services © 2005
Avoid the Hype • SGML is an obsolete technology • Proven over 20 years in the field for practical applications • Adaptable to meet wide array of text processing needs • DTDs are bad for your programmers • Parsers already exist • Proven methodology • User friendly J.V.G. Consulting Services © 2005
More Hype • XML is better than SGML • XML IS SGML • XML is a SUBSET of SGML • More limited; less capabilities • XSDs will replace DTDs • For data processing applications, why not • Overkill for text applications • And more restrictive • No entity references! J.V.G. Consulting Services © 2005
What Should Publishers Do? • Stop ignoring your SGML vendor • Use the four-letter acronym, it’s OK • SGML is your friend • Tell your boss it’s XML • Can they tell the difference? • It’s got angle brackets, doesn’t it? • Use your clout to get what you need • Stop following the leader (remember the lemmings) • Where do you want to go today? (MS jingle in background) • Don’t accept the programmer’s position as your own • You have different needs • Free parsers do not make better applications J.V.G. Consulting Services © 2005