270 likes | 446 Views
IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages & Properties. Overview. Review Properties of Documents Introduce the concept of Markup Languages. Describe the role of XML.
E N D
IN350 Lecture 2: Document Properties and Markup LanguagesAugust 29, 2002Judith A. Molka-DanielsenReference: Ch.6 Baeza-Yates, Text & Multimedia Languages & Properties
Overview • Review Properties of Documents • Introduce the concept of Markup Languages. • Describe the role of XML.
Classes of document processing • Text Processing: Initially computers were used to do tedious repetitive calculations (billing transactions) on information. • Often the calculations required preprocessing or typesetting of text. • Other issues include information storage (and compression algorithms to optimally store) and storage methods (indexing) and approaches to information retrieval. • Finally there was the preparation and processing of text for presentation purposes.
Classes of document processing • Document Processing: In the 1980s technologies like the PC, ethernet, laser printers, and graphical user interfaces with bit map displays, and text processing that was object based, allowed for indivduals to process documents. A text processing system called Scribe (by Brian Reid at CMU), represented a new kind of processing. • In text processors like IBM's Script, the user marked up text in terms of syntax characteristics, such as "12 point bold courier". • But Scribe formatted in terms of structural characteristics like, "heading". This was a transition to document processing.
Classes of document processing • Hypertext Processing: In the 1990s we saw the development of internetworks, and ubiquitous interfaces (windows). • Tim Berners-Lee at the National Radiation Lab at CERN created HTML and URL (Uniform Resource Locator) protocols so that a simple standardized form of markup, based on Scribe, could be used to describe documents and naming scheme would allow for the universal identification of documents. • So documents could be and viewed in graphical format and large collections linked across multiple internets. This is hypertext processing.
Properties of Documents • Syntax - can express structure, presentation style, semantics, and external actions. It can be implicit in the contents of a document or expressed in a language. • Structure - a structural element like a section can have can have a Formating Style associated with it that tells how the elements relate to each other within the document. • Presentation Style - is how the document is displayed or printed. It can be embedded in the documents such as in TeX, and use macros LaTeX. Or can be defined separately as CSS for HTML documents. Presentation style can be determined by the author (in applications or languages) or the reader (Web browser). • Semantics - the meaning within a language, can be associated with use.
Characteristics continued... • Metadata - information about the organization of the data. Data about the data. Such as, author, publication date, subject codes, etc.
Structured Label Information in Documents • There is a difference between Data and Documents. • Documents are formated. • WYSIWYG word processors have problems • They make documents that are for one output medium (printer,online) • Proprietary codes are for both style & format • But it is hard to convert old document collections (merge latex and word) • Formats like ”headline” only mean BIG font size, but have no structural meaning within the document • People use too many options within a document (30 fonts on a page.
Text and formats • File formats - • Word processing formats that are binary formats include Word and WordPerfect. • text - ASCII (American Code for Information Interchange) by ANSI X3.6. Alternativly there is 16 bit Unicode (ISO 10616). • raster graphics - • TIFF Tag Information File Format • GIF - Graphic Interchange Format • JPEG - Joint Photographic Experts Group • An example of a vector graphics standard is CGM Computer Graphics Metafile • printing - PostScript, PDF, EPS, PCL, LCDS, XML Printing Formats, ISO-IEC 10180 Standard Page Description Language, ISO-IEC 8624 Open Document Architecture (ODA)
Text and formats • File formats continued - • multimedia • MPEG (motion picture expert group) • AVI (audio video interleaved) • email • email header - RFC822 • SMTP - Simple Mail Transport Protocol, RFC823 • POP - Post Office Protocol • IMAP - Intelligent Mail Access Protocol (more advanced than POP) • MIME - Multimedia Internet Mail Extension (attachments)
Text and formats • File formats continued - • For document interchange between applications there is RTF (rich text format). • Compression formats include ARJ, ZIP, and uuencode/uudecode. • Streaming Video formats include: QuickTime –MOV/QT, DivX-MPEG-4, Real Audio/Video – RAM/RM, Window Media - WMV
What is Markup? • Markup is everything in a document that is not content. Typesetters used procedural markup to lay out instructions of how a document should look. (16 pt bold Helvetica) • Word Processing software like Microsoft Word uses Procedural markup. They have a specific set of markup codes. The codes apply to a single physical way of presenting information, such as on a printed page. It doesn't define the appearance on other media like CD-ROM or Internet. • Descriptive markup, or generic markup, describes the structure of the document rather than the appearance. Content is separate from style. You can publish on all media using the same structure instruction set.
SGML • SGML (Standard Generalized Markup Language, ISO 8879, 1986), specifies a standard method for describing the structure of the document. Structural elements are for example: title, chapter, paragraph. It is an extensible Meta Language. It can supports an infinite variety of document structures like: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters, memos. • The Document Type Definition (DTD) describes the structure of the document. (like a database schema in a database). The DTD provides a framework of elements (chapters, headers). The DTD specifies rules for the relationship between elements, ie. a chapter header must come after the start of a chapter. A document intance is a document whose contents is tagged in conformance with a DTD. A DTD can be applied throughout the whole organization.
SGML continued • SGML uses tagging to identify the contents position within a DTD structure. So we insert tags around the content. You can nest elements. A parser program verifies that a document follows the rules of a DTD. The parser checks if the document is structurally correct. • Documents can be ported to different formats for different output medium (printer, screen, CD Rom, speaker, TV) • Style is usally handled separately by style sheets, like Cascading Style Sheets (CSS).
HTML • HTML (first version in 1992) a tagging language that could be used on the World Wide Web for text formatting and linking documents. It adopts the syntax of SGML and is an application of SGML described by a particular DTD. HTML is not an extensible language. Authors cannot add their own tags. HTML supports style sheets written in CSS language (color, font, layout for web pages.) and Frameset to partition the browser window. • XHTML is modular approach to allow the support of markup tags in smaller client devices like cell phones, TVs, cars, kiosks, etc.
Positive features of HTML • HTML uses tags to separate content (text) from format (structure, appearance). • It lets amateurs control markup (good and bad) • HTML tags were used for appearance formatting, but little attention was used toward content structuring.
Negative features of HTML • HTML did not offer enough custom control over the WYSIWYG environment. • Things looked different in different browsers (reader interpreted, not author interpreted). • Navigating through hypertext requires user memory. • Designing hypertext (document collections) for easy searching is hard to do. Spiders, crawlers, robots, AltaVista index all try to index the web.
Comments on CSS • Cascading Style Sheets helped HTML by freeing tags like <font> and <b> from carrying format information. Puts them in the style sheet. • It lets tags like <header> carry structure information. • CSS is a styling tool that can work with other markup languages like XML.
Comments on separation of format and content The Document Formating • Structure • Appearance • Content • Information • Data Structure – HTML does this a little bit. XML has DTD or Schema. Appearance – or presentation, before HTML did this with tags like <b> but now all structure control should be taken out of HTML documents and put in CSS or XSL files.
Why a migration to XML was needed. • Binary files (in native formats) compress tightly for efficient transmission, but they are complex and proprietary. (XML files are larger, with markup there is more to store and transfer.(negative point)) • To change documents between applications is hard. Must save data in text formats & move. Conversions were not always good. (XML writers define write formats, standards for loading, saving, open transfer) (between databases) • Lock-in let MS sell new versions of word that could read old format, save in new format, and then old versions could not read the files in new format. But, XML will handle document description and data description. Will not lose structure and labels in move.
XML – what is it? • XML (XML 1.0, 1998, Extensible Markup Language) is also a meta language in that it describes other languages. There is not pre-defined list of elements. • Elements are specified using a DTD or Schema. Also style sheets can be used to specify the output format of each element (XSL). • XML is based on SGML but it is a subset and is considered easier to program. XML is also supported to be viewed in most current versions of browsers.
XML related standards • XPath Specifications for the data model and grammar for navigating an XML document. • XSL eXtensible Stylesheet Language includes a language for transforming XML documents (XSLT) and a formatting vocabulary (XSLFO). • XSLT eXtensible Stylesheet Language Transformation defines a transformation language to convert XML documents into other formats. • XLL extensible linking language allow logic to be placed on linking.
XML related standards & groups • OAGIS The Open Application Group's (www.openapplications.org) Integration Specification for interoperability between ERP packagesOASIS-ebXML • Organization for the Advancement of Structured In- formation Standards (OASIS) Electronic Business XML (www.ebxml.org). • FinXML Financial Markup Language (www.finxml.com) supports a universal standard for data interchange within the capital market. FpML Financial Products Markup Language (www.fpml.org) enables e-commerce activities in the financial derivatives field. OFX Open Financial Exchange (www.ofx.net) for the electronic exchange of financial data.
Other languages • MathML - tags for presenting formulas • SMIL - language for scheduling multimedia (Synchronized Multimedia Integration Language). It uses XML markup to identify and manage the presentation of files containing text, images, sound and video in multi-media presentations. • RDF - resource description format, format to contain metadata inform for XML. • HyTime - an SGML architecture that specifies the generic hypermedia structure of documents. Allows for the design of metaDTDs, for complex multimedia presentations, such as providing music with other media presentation. • See for more information on markup languages http://www.w3.org/
Here is the donut.xml file <?xml version="1.0"?> <?xml-stylesheet href="donut.xsl" type="text/xsl"?> <memo> <from>Jim</from> <to>Joe</to> <subject>Donuts again</subject> <date>April 13, 2001</date> <content>Donuts are here. But they will not be here for long. Benny ate 3. </content> </memo>
Here is the what you see in IE6.0 of the donut.xml file From: Jim To: Joe Re: April 13, 2001 Donuts are here. But they will not be here for long. Benny ate 3.
Here is the style sheet donut.xsl <xsl:stylesheet xmlns:xsl=http://www.w3.org/1999/XSL/Transformversion="1.0"> <xsl:output method="html"/> <xsl:template match="/"> <html><body> <xsl:apply-templates select="memo"/> </body></html></xsl:template> <xsl:template match="memo"> <p>From: <xsl:value-of select="from"/> </p> <p>To: <xsl:value-of select="to"/> </p> <p>Re: <xsl:value-of select="date"/></p><hr /> <p><xsl:value-of select="content"/></p> </xsl:template></xsl:stylesheet>