500 likes | 629 Views
XML for Information Management. 12.1.-16.1. 2009. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Day 1: Course introduction, XML examples and concepts. Outline. 1. Course introduction 2. XML examples
E N D
XML for Information Management 12.1.-16.1. 2009 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/
Day 1: Course introduction, XML examples and concepts Outline • 1. Course introduction • 2. XML examples • 3. XML concepts
1. Course introduction: Instructor • Home university: University of Jyväskylä in Finland, Faculty of Information Technology • Home page: http://users.jyu.fi/~airi/ • Experience Jyväskylä: • http://www3.jkl.fi/international/experience/index.html 3
1. Course introduction: Instructor • My research areas: structured documents, content management in organizations, document standardization, semantic web, information retrieval • My XML-related research has concerned: • modelling structured text • querying structured text • SGML/XML standardization 4
1. Course introduction: Instructor Tague, J., Salminen, A., & McClellan, C. (1991). Complete formal model for information retrieval systems. In Proc. of the 14thACM SIGIR Conference, 14-20. New York: ACM Press. Salminen, A., & Watters, C. (1992). A two-level structure for textual databases to support hypertext access. Journal of the American Society for Information Science 43 (6), 432-447. Salminen, A., & Tompa, F. (1993). PAT expressions: an algebra for text search Acta Linguistica Hungarica, 41 (1-4), 277-306. http://www.cs.jyu.fi/~airi/papers/COMPLEX-1992.pdf Salminen, A., Tague-Sutcliffe, J., & McClellan, C. (1995). From text to hypertext by indexing. ACM Transactions on Information Systems 13 (1), 69-99. Salminen, A., Lehtovaara, M., & Kauppinen, K. (1996). Standardization of digital legislative documents - a case study. In Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences (pp. 72-81). Los Alamitos, CA: IEEE Computer Society Press. Kuikka, E., & Salminen, A. (1997). Two-dimensional filters for structured text. Information Processing and Management 33 (1), 37-54. 5
1. Course introduction: Instructor Salminen, A., Kauppinen, K., & Lehtovaara, M. (1997). Towards a methodology for document analysis. Journal of the American Society for Information Science 48 (7), Special Issue on Structured Information/Standards for Document Architectures, 644-655. Salminen, A., & Tompa, F. (1999). Grammars++ for modelling information in text. Information Systems 24 (1), 1-24. Salminen, A., Tiitinen, P., & Lyytikäinen, V. (1999). Usability evaluation of a structured document archive. In Proc. of the Thirty-Second Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. Lyytikäinen, V., Tiitinen, P., & Salminen, A. (2001). XML metadata for accessing heterogeneous legal databases. In Proc. of the XML Europe 2001 Conference.http://www.gca.org/papers/xmleurope2001/papers/html/s27-4.html Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. In Proc. of the ACM Symposium on Document Engineering (DocEng '01), 85-94. New York: ACM Press. Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2001). Experiences of SGML standardization: The case of the Finnish legislative documents. In Proc. of the Thirty-Fourth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. 6
1. Course introduction: Instructor Salminen, A. (2003). Document analysis methods. Encyclopedia of Library and Information Science, Second Edition, Revised and Expanded (pp. 916-927). New York: Marcel Dekker. New York: ACM Press. Korhonen, R. & Salminen, A. (2003). Visualization of EDI messages: Facing the problems in the use of XML. In Proc. of the Fifth International Conference on Electronic Commerce, 466-473. New York: ACM Press. Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2004). Implementing digital government in the Finnish Parliament. In Digital Government: Strategies and Implementation (pp. 242-259). Hersley, PA: IDEA Group Publishing Salminen, A. (2005). Building digital government by XML. In Proc. of the Thirty-Eighth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. Salminen, A., Nurmeksela, R., Lehtinen, A., Lyytikäinen, V., & Mustajärvi, O. (2006). Content production strategies for e-Government. In Encyclopedia of Digital Government, Vol. I (pp. 224-230). Hersley, PA: IDEA Group Publishing. Nurmeksela, R., Jauhiainen, E., Salminen, A., & Honkaranta, A. (2007). XML document implementation: Experiences from three cases. In Proceedings of the Second International Conference on Digial Information Management (pp. 224-229). Los Alamitos, CA: IEEE. 7
1. Course introduction: Instructor XML-related projects • RASKE (1994-1998): Developing Standards for Structured Documents • inSGML (1998-2001): Methods for SGML standardization in industry • EULEGIS (1998-2000): European User Views to Legislative Information in Structured Form • AirXML (2002-2004): XML and Data Warehousing in Air Defence • RASKE2 (2003-2006): Methods for the Integration of Systems and Services in e-Government 8
1. Course introduction • Syllabus: • http://users.jyu.fi/~airi/opetus/xml/erlangen/ • Course Readings: • available on the course web site • Project Assignment: • http://users.jyu.fi/~airi/opetus/xml/erlangen/project.html • Contact by email: airi.salminen@jyu.fi 9
1. Course introduction: project • Purpose • The projects are intended to explore the application of XML in various contexts. Students interested in practical XML exercises are free to suggest a practical project where they can test some XML software and/or build an application of their own. • The project can also be an investigation of an existing or planned XML solution in an organizational context together with an analysis of the impacts of the solution. • Topics: Proposed by students • Teams of two, or individual projects • The phases • 2 page topic proposal: due on Feb. 20 • Project report: due on March 31 10
2. XML examples • separation of the primary content and markup • markup is metadata adding some information to the primary content <?xml version = "1.0"?> <poem author = ”Murasaki Shikibu” author_born = ”974”> <stanza> <line>This life of ours would not cause you sorrow</line> <line>if you thought of it as like</line> <line>the mountain cherry blossoms</line> <line>which bloom and fade in a day.</line> </stanza> </poem> Note: The text of theline elements is taken from http://www.bopsecrets.org/rexroth/translations/japanese.htm, containing Kenneth Rexroth’s translations of Japanese poetry
2. XML examples External presentation for human perception can be defined in a separate stylesheet. By a proper stylesheet the previous XML document might look like: This life of ours would not cause you sorrow if you thought of it as like the mountain cherry blossoms which bloom and fade in a day. Examples of the attachment of stylesheets. Try ”xml examples” by Google.
2. XML examples A piece of prose in the TEI Guidelines: http://www.tei-c.org/Guidelines/Customization/Lite/U5-eg.html
3. XML concepts XML = Extensible Markup Language A set of rules for defining and representing information as structured documents for applications on the Internet. XML is a restricted form of the older markup language called SGML. T. Bray, J. Paoli, & C. M. Sperberg-McQueen (Eds.), Extensible Markup Language (XML) 1.0, W3C Recommendation 10- February-1998, http://www.w3.org/TR/1998/REC-xml-19980210/ T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, & F. Yergeau (Eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation 16 August 2006, http://www.w3.org/TR/2008/REC-xml-20081126/ T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, & J. Cowan (Eds.), Extensible Markup Language (XML) 1.1. (Second Edition) W3C Recommendation 16 August 2006. http://www.w3.org/TR/2006/REC-xml11-20060816/ XML Development History: http://www.w3.org/XML/hist2002
3. XML concepts: XML processor Processing XML documents XML Document XML Processor Application
3. XML concepts: physical and logical structure XML processor recognizes from a document two structures: • physical structure, consisting of entities • logical structure where elements are the core composites
3. XML concepts: entity Entity • file (text or some other kind of data) • named piece of text
3. XML concepts: entity Example of an entity structure root entity part 1 part 2 figure1.jpg figure2.jpg figure3.jpg entity entity reference
3. XML concepts: entity Entity as a named piece of text, like in HTML: Yö Jyväskylässä Yö Jyväskylässä
3. XML concepts: element Element An element is marked-up by a begin-tag and an end-tag. <year>1654</year> end-tag begin-tag content
3. XML concepts: element Example 1: a document of seven elements <?xml version="1.0"?> <rhymecollection> <rhyme> <line>Ole aina iloinen</line> <line> niin kuin pikku varpunen</line> </rhyme> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection>
3. XML concepts: tree structure Example 1 as an element tree root element rhymecollection rhyme rhyme line line line line • There is always one root element • Every non-root element is a child element of a parent element
3. XML concepts: attribute Extra information can be attached to elements by attributes An attribute has: • name • value (character string) <lastname earlier=“Rantanen”>Korhonen</lastname> name value Two predefined attributes: xml:lang and xml:space. xml:lang for identifying the language of the content of an element xml:space for signaling that the white spaces should be preserved by the application
3. XML concepts: elements and attributes Data in XML elements: • as element content • as attribute value
3. XML concepts: elements and attributes Three alternative ways for giving two lastnames for a person: <lastname earlier=“Rantanen”>Korhonen</lastname> 1. 2. <lastname> <earlier>Rantanen</earlier> <now>Korhonen </now> </lastname> 3. <lastname earlier=“Rantanen” now=“Korhonen”> </lastname> What is the difference?
3. XML concepts: elements and attributes In the logical structure Child elements of a parent element are ordered. The writing order of attributes in an element is insignificant.
3. XML concepts: elements and attributes 2. child element 1. child element 2. child element Different structures: <lastname> <earlier>Rantanen</earlier> <now>Korhonen </now> </lastname> 1. child element <lastname> <now>Korhonen </now> <earlier>Rantanen</earlier> </lastname>
3. XML concepts: elements and attributes Equivalent solutions: <lastname earlier=“Rantanen” now=“Korhonen”> </lastname> <lastname now=“Korhonen” earlier=“Rantanen” > </lastname>
3. XML concepts: Unicode XML documents encoded in:Unicode intended for content written in any natural language of the world The development work done by the Unicode Consortium The latest version:Unicode 5.1.0
3. XML concepts: DTD XML is a meta language intended to define languages for special application areas Document Type Definition (DTD) is the mechanism to define languages
3. XML concepts: DTD DTD : <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ELEMENT line (#PCDATA)> ]> Example 1 meets the constraints defined in the DTD.
3. XML concepts: DTD Attributes added <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]>
3. XML concepts: DTD DTD can be attached to a document • as in an internal subset • as an external subset • by combining internal and external markup declarations DTD consists of all markup declarations together.
3. XML concepts: DTD Internal DTD <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection>
3. XML concepts: DTD <?xml version="1.0"?> <!DOCTYPE rhymecollection SYSTEM ”myrhyme.dtd”> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> System identifier ”myrhyme.dtd" gives the address for the external DTD
3. XML concepts: DTD Text Declaration markup declarations in ”myrhyme.dtd”: <?xml version="1.0"?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]>
3. XML concepts: DTD DTD is just one definition mechanism available for constraining XML data. The most important: • XML Schema • RELAX NG The term schema or (XML schema) can refer to a definition written by any definion mechanism developed for XML data. The languages for defining schemas are called schema languages. 37
3. XML concepts: XML application An XML application is an XML-based language, (usually) defined by some schema language. Examples of XML applications: • XHTML:http://www.w3.org/TR/xhtml1/ • RSS (Really Simple Syndication):http://blogs.law.harvard.edu/tech/rss • TEI (Text Encoding Initiative):http://www.tei-c.org/index.xml • ebXML (Electronic Business using XML):http://www.ebxml.org/
3. XML concepts: XML application XML -- SGML – HTML -- XHTML • XML is a subset of SGML • HTML is an SGML application • XHTML is an XML application
3. XML concepts: well-formed and valid Two kinds of constraints in the XML specification: • well-formedness constraints: all XML documents have to meet them and they are called well-formed • validity constraints: documents associated with a DTD and meeting the constraints (including that they have to meet the constraints expressed in the DTD) are called valid
3. XML concepts: well-formed and valid A requirement for well-formed documents: each child element has to be contained in the parent element <date><day>24<month>1</day></month><year>2005</year></date> NOT well-formed
3. XML concepts: well-formed and valid <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme xml:lang = “fi”> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> VALID, even though the attribute value is not correct
3. XML concepts: well-formed and valid <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> NOT valid
3. XML concepts: Namespaces Often need to use elements and attributes originating from different environments (or applications). Vocabularies in two environments may include common names intended for different purposes. If multiple declarations used in a single DTD, name collisions must avoided.
3. XML concepts: Namespaces • XML namespaces • Provides a method for qualifying element and attribute names so that name collisions can be avoided • Motivation: modularity and documentation If a well-understood markup vocabulary for element and attribute names exists, it shoud be re-used rather than re-invented, especially if there is also software available. http://www.w3c.org/TR/REC-xml-names
3. XML concepts: Namespaces XML namespace Collection of names, identified by a URI No formal rules for defining names in a namespace URI (Uniform Resource Identifier) • URL (Uniform Resource Locator) or • URN (Uniform Resource Name) Generic Syntax, RFC 3986: http://www.ietf.org/rfc/rfc3986.txt In XML Names 1.1 URI has been replaced by IRI (Internationalized Resource Identifier, RFC 3987: http://www.rfc-editor.org/rfc/rfc3987.txt
3. XML concepts: Namespaces • Example • Namespace: http://uwaterloo.ca • Element names: department, name, professor, student, last_name, first_name, ... • Global attribute names: id, ... • Per-element-type attribute names: student: supervisor, ...
3. XML concepts: Namespaces Namespace declaration: defines a label (prefix) for the namespace and associates it to the namespace identifier (URI) Qualified name: a namespace prefix and a local part, separated by a colon <?xml version="1.0"?> <report xmlns:uw="http://uwaterloo.ca"> <uw:department> <uw:name>Department of Computer Science</uw:name> ... </report>
3. XML concepts: Namespaces Prefix xml is reserved for W3C development work and its identifier is http://www.w3.org/XML/1998/namespace. The namespace can be declared in a document but it can be used without declaration. Prefix xmlns is used only for declaring namespaces. It cannot be used as a name of a namespace.
Open source software for experimentations: http://www.w3.org/Status 50