370 likes | 503 Views
XML for Information Management. 26.4.-30.4.2010. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Outline. 1. Structured documents 2. Formal grammars in XML 3. Natural languages in XML documents
E N D
XML for Information Management 26.4.-30.4.2010 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/
Outline 1. Structured documents 2. Formal grammars in XML 3. Natural languages in XML documents 4. Adding meaning by markup 5. Text indexing 6. Logical structure of XML documents
1. Structured documents Structured document • structure, content, and external presentation can be separated from each other and processed separately • structural components have names • structural components can be recognized by software modules • possible to define the structure
1. Structured documents Structure Content Layout Structured document different languages for defining the structure, e.g., DTD, XML Schema, RELAX NG for XML an open language standard, e.g. SGML, XML different languages for defining the layout, e.g., CSS and XSL for XML
1. Structured documents Structure Content Layout Structured document Example DTD.txt rhymes-with-ext-dtd.txt rhymes-with-ext-dtd.xml rhymes-style.txt rhymes-style.css rhymes-with-style-and-ext-dtd.txt rhymes-with-style-and-ext-dtd.xml
1. Structured documents Management of structured documents • document management • management of the data contained in documents
1. Structured documents Characteristics in the management of structured documents • Design.Adopting the approach of structured document management in an environment often requires careful planning before the creation of documents. Includes schema design and layout design. • Content production. Content can be produced by different types of software, e.g. by a syntax-directed editor. Checking the validity against the schema. • Evolution. Schema versioning, layout versioning. • Operations. Most typical operation is some kind of transformation. • Software. Many kinds of software systems used.
2. Formal grammars in XML A formal grammar is a way to describe the syntax of language. • terminal symbols (alphabet) • nonterminal symbols • production rules • start symbol The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present.
2. Formal grammars in XML In XML there are two kinds of formal grammars with their own notations: • the grammar defining the XML syntax in the XML specification • DTD
2. Formal grammars in XML The XML specification uses the EBNF (Extended Backus-Naur Form) notation with metasymbols ?, *, +, |, and ( ) The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b]. The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.
2. Formal grammars in XML A? A is optional A| B A and B are alternatives A + A occurs once or more A* A may be missing or occurs once or more A - B A but not B A B B after A ( ) grouping Example rules in XML 1.0: document ::= prologelementMisc* prolog ::= XMLDecl? Misc* (doctypedeclMisc*)? Misc ::= Comment | PI | S Comment ::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'
2. Formal grammars in XML Production rules in a DTD: <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ELEMENT line (#PCDATA)> DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification.
2. Formal grammars in XML XML spesification defines the concrete syntax of XML documents. The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed four slightly different models to describe the abstract syntax: • XML Information Set • DOM model • XPath 1.0 model • XQuery 1.0 and XPath 2.0 data model Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.
3. Natural languages in XML documents Natural language may occur in XML marked up text in the: • content of elements • markup • element, attribute, and entity names • attribute values • comments
3. Natural language in XML documents Natural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by • human individuals in • reading the markedup text • information access • communicating with other individuals about the schema or marked up content • some software applications, for example, text analysis software
4. Adding meaning by markup It is important that the element and attribute names are meaningful to human readers. <AAA XXX= "5" > <rki YYY="Hamlet" > Where wilt thou lead me? speak; I'll go no further. </rki> <rki YYY="ghost"> Mark me. </rki> </AAA> The names are not useful in information access
4. Adding meaning by markup • Natural language in XML documents provides semantic information to human readers and for human communication. • Meaningful markup is useful for human users in information retrieval and in specifying transformations. • Markup may provide rich semantic and linguistic information.
4. Adding meaning by markup Example of combining structural, semantic and linguistic markup: She smelled like trees. <Chapter section = '1' > <Paragraph id='143' FragmentCode='1.12'> <Narration narrator='Benjy'> <Subject person='Caddy'>She</Subject> <Senses mode='smell'>smelled</Senses> like <Imagery referent='tree'>trees</Imagery> </Narration> </Paragraph> </Chapter> Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
4. Adding meaning by markup Another markup for the same text: She smelled like trees. <Chapter section = '1' > <Narration narrator='Benjy'> <Imagery place='tree' mode='simile' sense='smell'> <Fragment code='1.12'> <Paragraph id='143'> <Subject person='Caddy'>She</Subject> smelled like trees. </Paragraph> </Fragment> </Imagery> </Narration> </Chapter> Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
4. Adding meaning by markup Some other examples: http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htm http://www.cs.cmu.edu/~awb/festival_demos/sable.html http://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html
4. Adding meaning by markup • In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form. • The concepts and meanings are defined in formal ontologies. • Software applications can understand the meanings.
5. Text indexing documents search engine query answer index In information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.
6. Logical structure of XML documents Components of the logical structure • declarations • elements • comments • processing instructions
6. Logical structure of XML documents document ::= prolog element Misc* declarations comments processing instructions comments processing instructions elements comments processing instructions 24
6. Logical structure of XML documents Declarations: • XML declaration [23] • document type declaration [28] • markup declaration [29] • element type declaration [45] • attribute list declaration [52] • entity declaration [70] • notation declaration [82] • encoding declaration [80] • standalone document declaration [32] • text declaration [77] to constrain the logical structure to constrain the physical structure
6. Logical structure of XML documents Typical element type declarations: element content defined <!ELEMENT product (mfg, model, description, clock?)> <!ELEMENT model (#PCDATA)> <!ELEMENT description (#PCDATA | feature)*> <!ELEMENT clock EMPTY> mixed content defined empty element defined
6. Logical structure of XML documents empty element defined: <!ELEMENT clock EMPTY> two forms of the element allowed in a well-formed document: <clock></clock> <clock/> 27
6. Logical structure of XML documents element content: definition by content models with metasymbols * iteration (none or more) + iteration (once or more) | alternatives ? optional , successive ( ) grouping Example from XHTML 1.0 Strict DTD: <!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> #PCDATA is not accepted in the content model! 28
6. Logical structure of XML documents mixed content: definition has basically two forms (#PCDATA) (#PCDATA | e1 | … | en)* examples: <!ELEMENT text (#PCDATA)> <!ELEMENT section (#PCDATA | subsection)*> <!ELEMENT section (#PCDATA | subsection | paragraph)*> #PCDATA is always included in the content specification and comes first in the list of alternatives 29
6. Logical structure of XML documents Attribute list declarations • to define the set of attributes pertaining to a given elemen type • to establish type constraints for these attributes • to provide default values for attributes 30
6. Logical structure of XML documents <!ATTLIST poem author CDATA #REQUIRED > element type attribute name constraint: the attribute must be specified for all elements of type poem attribute type: string
6. Logical structure of XML documents Defining constraints [60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue) #REQUIRED: attribute must always be provided in all elements of the given type #IMPLIED: attribute can be provided in a element; no default value is provided AttValue: default value is given between single or double quotes #FIXED AttValue: instances of the attribute must match the given default value
6. Logical structure of XML documents Attribute types [54] AttType ::= StringType | TokenizedType | EnumeratedType tokenized types: • ENTITY, ENTITIES: entity names • NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names • ID: names that uniquely identify elements • IDREF, IDREFS: references to ID type identifiers enumerated types: • NOTATION, NOTATIONS: identify notations • enumeration
6. Logical structure of XML documents <?xml version="1.0"?> <!DOCTYPE text [ <!ELEMENT text (line+)> <!ELEMENT line (#PCDATA)> <!ATTLIST line id ID #REQUIRED seeline IDREFS #IMPLIED> ]> <text> <line id= "r1">This is the first line</line> <line id= "r2" seeline= "r1" > This is the second line, but look at the first too </line> </text>
6. Logical structure of XML documents XML-aware web browsers support the visualization of the tree structure: example <Chapter section = '1' ><Narration narrator='Benjy'> <Imagery place='tree' mode=simile sense='smell'> <Fragment code='1.12'><Paragraph id='143'> <Subject person='Caddy'>She</Subject>smelled like trees. </Paragraph></Fragment></Imagery> </Narration></Chapter> 35
6. Logical structure of XML documents Different abstract models to decribe the tree in slightly different ways. <poem author = "Murasaki Shikibu" born = "974"> <!-- The poem is translated from Japanese by Kenneth Rexroth --> <line>This life of ours would not cause you sorrow</line> <line>if you thought of it as like</line> <line>the mountain cherry blossoms</line> <line>which bloom and fade in a day. </line> </poem>
6. Logical structure of XML documents poem Node types of XPath 1.0 poem born 974 Author Murasaki Shikibu line line line line which bloom and fade in a day. the mountain cherry blossoms if you thought of it as like This life of ours would not cause you sorrow The poem is translated from Japanese by Kenneth Rexroth Root node Text node Element node Comment node Attribute node