180 likes | 304 Views
XML for Information Management. 12.1.-16.1. 2009. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Day 3: Formal and Natural Languages in XML. Outline. 1. Formal grammars in XML 2. Natural language in XML documents
E N D
XML for Information Management 12.1.-16.1. 2009 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/
Day 3: Formal and Natural Languages in XML Outline 1. Formal grammars in XML 2. Natural language in XML documents 3. The meaning of markup 4. Text indexing
1. Formal grammars in XML A formal grammar is a way to describe the syntax of language. • terminal symbols (alphabet) • nonterminal symbols • production rules • start symbol The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present.
1. Formal grammars in XML There are different formalisms to present a formal grammar, for example, Backus-Naur Form (BNF) and different Extended Backus-Naur Forms (EBNF). Example.A grammar for a tree order tree_order ::= order ('; ' quantity)* order ::= tree ': ' quantity tree ::= 'MAPLE' | 'OAK'| 'PINE' quantity ::= '5' | '10' Alphabet: '5', '10', 'OAK', 'MAPLE', 'PINE', ': ' , ’; ' Non-terminal symbols: tree_order, order, tree, quantity Start symbol: tree_order Metasymbols * and | are used to indicate repetition and alternatives, respectively.
1. Formal grammars in XML In XML there are two kinds of formal grammars with their own notations: • the grammar defining the XML syntax in the XML specification • DTD
1. Formal grammars in XML The XML specification uses the EBNF notation with metasymbols ?, *, +, |, and ( ). The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b]. The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.
1. Formal grammars in XML A? A is optional A| B A and B are alternatives A + A occurs once or more A* A may be missing or occurs once or more A - B A but not B A B B after A ( ) grouping Example rules in XML 1.0: document ::= prologelementMisc* prolog ::= XMLDecl? Misc* (doctypedeclMisc*)? Misc ::= Comment | PI | S Comment ::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'
1. Formal grammars in XML Production rules in a DTD: <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ELEMENT line (#PCDATA)> DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification.
1. Formal grammars in XML XML spesification defines the concrete syntax of XML documents. The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed a number of slightly different models to describe the abstract syntax.
2. Natural language in XML documents Natural language may occur in XML marked up text in the: • content of elements • markup • element and attribute names • attribute values • comments
2. Natural language in XML documents Natural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by • human individuals in • reading the markedup text • information access • communicating with other individuals about the schema or marked up content • some software applications, for example, text analysis software
3. Adding meaning by markup It is important that the element and attribute names are meaningful to human readers. <AAA XXX= "5" > <rki YYY=”Hamlet”> Where wilt thou lead me? speak; I'll go no further. </rki> <rki YYY=”ghost”> Mark me. </rki> </AAA> The names are not useful in information access
3. Adding meaning by markup • Natural language in XML documents provides semantic information to human readers and for human communication. • Meaningful markup is useful for human users in information retrieval and in specifying transformations. • Semantic and linguistic information can be added to natural language content by markup.
3. Adding meaning by markup Example of combining structural, semantic and linguistic markup: She smelled like trees. <Chapter section = '1' > <Paragraph id='143' FragmentCode='1.12'> <Narration narrator='Benjy'> <Subject person='Caddy'>She</Subject> <Senses mode='smell'>smelled</Senses> like <Imagery referent='tree'>trees</Imagery> </Narration> </Paragraph> </Chapter> Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
3. Adding meaning by markup Another markup for the same text: She smelled like trees. <Chapter section = '1' > <Narration narrator='Benjy'> <Imagery place='tree' mode='simile' sense='smell'> <Fragment code='1.12'> <Paragraph id='143'> <Subject person='Caddy'>She</Subject> smelled like trees. </Paragraph> </Fragment> </Imagery> </Narration> </Chapter> Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
3. Adding meaning by markup Some other examples: http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htm http://www.cs.cmu.edu/~awb/festival_demos/sable.html http://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html
3. Adding meaning by markup • In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form. • The concepts and meanings are defined in formal ontologies. • Software applications can understand the meanings.
4. Text indexing documents search engine query answer index In information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.