470 likes | 488 Views
Use of XML in the Publications Office: Critical issues for publishing. Dr. Holger Bagola Publications Office DIR/R 5 “IT Projects” section “ Formats & Linguistic Informatics ” October 2006. History From SGML to XML Structure of publications in Formex Streamlining of models
E N D
Use of XML in the Publications Office:Critical issues for publishing Dr. Holger Bagola Publications Office DIR/R 5 “IT Projects” section “Formats & Linguistic Informatics” October 2006
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
History • In the 70ies more and more publication procedures were supported by computer applications. • No common standard for applications in the context of publishing Publishing houses were confronted by a large variety of formats. Use of XML in the Publications Office
History • A considerable amount of documents published in the Official Journal can be totally of partially re-used for the publications of other documents. As the electronic formats of published documents were not standardized, it was impossible to install convenient procedures. Use of XML in the Publications Office
History • First information published on SGML as a future standard for the exchange of documents in the early 80ies • Main advantages of the approach: • Independence from any application or operating platform • Description of logical document structure instead of presentation Use of XML in the Publications Office
History In 1982 the Publications Office decided to define a format for the exchange of published documents: Formex (Format for the exchange of electronic publications). Use of XML in the Publications Office
History • Publication of Formex specifications in 1984/1985 • Formex part of the framework contract for OJ publications in 1985 • 1986: Adoption of the SGML standard by ISO (ISO 8879) Use of XML in the Publications Office
History BUT . . . There was not a real support of the format on the market (parsers, editors, etc.). The approach seemed to be rather exotic for printing houses which were used to the presentation of documents. The quality of delivered SGML documents was rather poor. Use of XML in the Publications Office
History • Revision and partial redesign of Formex • Addition of a basic table model Formex 2 was easier to understand by the framework contractors. Better quality, but still insufficient for publication: impossible to derive the document presentation from the rough description of the document structure. Use of XML in the Publications Office
History • Total redesign of Formex specifications • Implementation of more flexible table model • Integration of metadata into the SGML document structure • Finer granularity and distinct elements for description of document structure (possibility of deriving presentation from structure Use of XML in the Publications Office
History Rather complex specification which needed an intensive validation of the deliveries. Use of XML in the Publications Office
History • Since 1998: XML as a new, but compatible standard was adopted by W3C. • XML was immediately accompanied by additional standards which supported the navigation and transformation of documents. • A new standard for the specification of XML grammars was adopted in 2001: XML Schema Use of XML in the Publications Office
History • In 2001 the Publications Office organized a Formex user meeting to discuss about future development of the approach. The main results of this meeting were: • Migration to XML for which various tools were on the market (partly as open source) • Replacement of the DTD methodology for specifying XML grammars by XML Schema Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
From SGML to XML • Revision of approach in order to define a grammar which meets the needs of printing houses without abandoning the description of the logical document structure • Definition of a table model based on the HTML model (keeping logical relations and functions in attributes) Use of XML in the Publications Office
From SGML to XML • Abandon of parallel models: distinction made by context analysis • Replacement of character encoding based on ISO 2022 by Unicode (UTF-8, the default for XML instances) • All documents contain a reference to the Formex schema on the web:http://formex.publications.europa.eu Use of XML in the Publications Office
From SGML to XML • Distinction of up to four levels of a publication • Definition of rules for automatic validation of Formex instances beyond parsing • Development of a comparison tool for the contents of Formex instances with corresponding PDF instances • Automatic extraction of metadata for updating of EUR-Lex Use of XML in the Publications Office
From SGML to XML • The XML based version of the Formex 4 specifications entered into force on May 1st,2004. • The current release is 3.00. Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
Structure of publications in Formex • Formex instances concern OJ publications only (L and C series) • Other publications are possible, but currently not realized Use of XML in the Publications Office
Structure of publications in Formex • Description of publication structure: • Description of structure and composition of the publication stricto sensu • Description of structure and composition of a document • Contents of document and sub-documents • Non-XML parts or fragments of documents Use of XML in the Publications Office
Structure of publications in Formex Publication Descriptionof logicalstructure and composi-tion Referencesto documents Document Referencesto main andsub-docu-ments Maindocument Non-XMLinstance Sub-document Non-XMLinstance Sub-document Non-XMLinstance Document Referencesto main andsub-docu-ments Maindocument Use of XML in the Publications Office
Structure of publications in Formex • In order to keep a minimum of metadata information together with the contents of a document some of the corresponding items are present on various levels. • All sub-levels contain references to the superior hierarchical level (except for non-XML instances). Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
Streamlining of models • Whenever a Formex 3 element could appear in various contexts distinct elements were created. Thus there were parallel models such as TI.DOC, TI.ANNEX, TI.GRSEQ etc. These elements were grouped together, the context expressing the distinct functions. Use of XML in the Publications Office
Old ACT/TI.DOC ANNEX/TI.ANNEX GR.SEQ/TI.GRSEQ New ACT/TITLE ANNEX/TITLE GR.SEQ/TITLE TITLE[parent::ACT] TITLE[parent::ANNEX] TITLE[parent::GR.SEQ] Streamlining of models Use of XML in the Publications Office
Streamlining of models Old table model • The table model in Formex 1-3 was a logical one, distinguishing between the column and line headings and the body. • The body could easily be identified and copied to another linguistic version. Use of XML in the Publications Office
Streamlining of models Old table model • Empty cells were not present in old instances. • Attributes expressed the relation between cells and columns. Use of XML in the Publications Office
Streamlining of models New table model • Top-down model for headings and body. • Attributes express the distinct function of a specific cell. • Empty cells are present containing a special attribute which explicitely confirms the absence of any contents. Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
Current status of Formex • Formex 4 is totally W3C Schema based. • It is in use since May 2004. • Minor changes were integrated (release 3.0) • All OJ (L and C) documents are covered. • Further document types (not published in OJ) will be taken into account. Use of XML in the Publications Office
Current status of Formex • Specification, documentation of all elements, physical specification, examples (> 600) publicly available on web-site:http://formex.publications.europa.eu Use of XML in the Publications Office
Current status of Formex • Availability of Formex via the LegisWrite interface • XML instances are not (yet?) publicly accessible • Different quality levels according to validation Use of XML in the Publications Office
Current status of Formex Printing house CERES Quality 1 Quality 2 Quality 3 EUDOR Automaticvalidation Manualvalidation Conversionto LW LegisWriteInterface Client Client Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
Particular needs for publishing • Publishing mostly concerns the presentation of documents in a readable form. • A “good” logical XML model allows for the derivation of the presentation of a given document. • Printing houses are obliged to work with Formex instances along the production processes. Use of XML in the Publications Office
Particular needs for publishing • Some parts of a document (words, parts of a sentence) require a specific presentation which is not always logical. • Specific elements for text highlighting and presentation had to be created. Ex. Foreign words in some language versions in italics. Use of XML in the Publications Office
Particular needs for publishing • Quotation marks differ from one language version to the other. • Exceptions for the use on nested levels require the presence of the specific symbols. Use of XML in the Publications Office
Particular needs for publishing • For special cases the printing houses are allowed to use temporary additional markup (processing instructions, elements from other namespaces). • In most cases this kind of information depends on the publishing system. Use of XML in the Publications Office
Particular needs for publishing • All this additional information has to be deleted before sending the electronic version of the publication. • For the design of new elements the relation to presentation has to be analyzed. • In most cases it has to be assured to guarantee the correct identification of the new element. Use of XML in the Publications Office
Particular needs for publishing • Conversion into other electronic formats requires similar measures. • Regular derivations are • Presentation in the Official Journal • Presentation in LegisWrite • Presentation in HTML Use of XML in the Publications Office
Particular needs for publishing Formex(XML) instance Format “Official Journal”(PDF) Format “LegisWrite”(RTF) Format “EUR-Lex”(HTML) Use of XML in the Publications Office
History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office
Conclusion • Since the beginnings Formex is a common exchange format which is independent from any application or platform. • Clear character encoding in all versions Use of XML in the Publications Office
Conclusion • Availability of tools on the market for XML based instances: • RXP for validating DTD parsing • XSV for validating XML Schema parsing • XMLSpy for development (+ Saxon) • XMetal for content editing • renderX for generation of PDF Use of XML in the Publications Office
Conclusion • Stylesheets (based XSL FO) for presentation • Future enhancements: • Better integration of other source formats (RTF/LegisWrite) • Addition of other document types not necessarily related to the Official Journal Use of XML in the Publications Office