400 likes | 551 Views
XML for Information Management. 12.1.-16.1. 2009. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Day 4: Logical and Physical Structure of XML Documents. Outline. 1. Components of the logical structure
E N D
XML for Information Management 12.1.-16.1. 2009 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/
Day 4: Logical and Physical Structure of XML Documents Outline 1. Components of the logical structure 2. XML documents as trees 3. Entity types 4. Entity declarations and references 5. XML processor treatment of entity references 6. Motivations for the use of entities
1. Components of the logical structure • declarations • elements • comments • processing instructions
1. Components of the logical structure document ::= prolog element Misc* declarations comments processing instructions comments processing instructions elements comments processing instructions 4
1. Components of the logical structure Declarations: • XML declaration [23] • document type declaration [28] • markup declaration [29] • element type declaration [45] • attribute list declaration [52] • entity declaration [70] • notation declaration [82] • encoding declaration [80] • standalone document declaration [32] • text declaration [77] to constrain the logical structure to constrain the physical structure
1. Components of the logical structure Typical element type declarations: element content defined <!ELEMENT product (mfg, model, description, clock?)> <!ELEMENT model (#PCDATA)> <!ELEMENT description (#PCDATA | feature)*> <!ELEMENT clock EMPTY> mixed content defined empty element defined
1. Components of the logical structure empty element defined: <!ELEMENT clock EMPTY> two forms of the element allowed in a well-formed document: <clock></clock> <clock/> 7
1. Components of the logical structure element content: definition by content models with metasymbols * iteration (none or more) + iteration (once or more) | alternatives ? optional , successive ( ) grouping Example from XHTML 1.0 Strict DTD: <!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> #PCDATA is not accepted in the content model! 8
1. Components of the logical structure mixed content: definition has basically two forms (#PCDATA) (#PCDATA | e1 | … | en)* examples: <!ELEMENT text (#PCDATA)> <!ELEMENT section (#PCDATA | subsection)*> <!ELEMENT section (#PCDATA | subsection | paragraph)*> #PCDATA is always included in the content specification and comes first in the list of alternatives 9
1. Components of the logical structure Attribute list declarations • to define the set of attributes pertaining to a given elemen type • to establish type constraints for these attributes • to provide default values for attributes 10
1. Components of the logical structure <!ATTLIST poem author CDATA #REQUIRED > element type attribute name constraint: the attribute must be specified for all elements of type poem attribute type: string
1. Components of the logical structure Defining constraints [60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue) #REQUIRED: attribute must always be provided in all elements of the given type #IMPLIED: attribute can be provided in a element; no default value is provided AttValue: default value is given between single or double quotes #FIXED AttValue: instances of the attribute must match the given default value
1. Components of the logical structure Attribute types [54] AttType ::= StringType | TokenizedType | EnumeratedType tokenized types: • ENTITY, ENTITIES: entity names • NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names • ID: names that uniquely identify elements • IDREF, IDREFS: references to ID type identifiers enumerated types: • NOTATION, NOTATIONS: identify notations • enumeration
1. Components of the logical structure <?xml version=”1.0”?> <!DOCTYPE text [ <!ELEMENT text (line+)> <!ELEMENT line (#PCDATA)> <!ATTLIST line id ID #REQUIRED seeline IDREFS #IMPLIED> ]> <text> <line id=”r1”>This is the first line</line> <line id=”r2” seeline=”r1”> This is the second line, but look at the first too </line> </text>
2. XML documents as trees <Chapter section = '1' ><Narration narrator='Benjy'> <Imagery place='tree' mode=simile sense='smell'> <Fragment code='1.12'><Paragraph id='143'> <Subject person='Caddy'>She</Subject>smelled like trees. </Paragraph></Fragment></Imagery> </Narration></Chapter> XML-aware web browsers support the visualization of the hierarchic structure: example 15
2. XML documents as trees XML specification defines a concrete syntax for XML documents. W3C has defined four slightly different abstract models to decribe the abstract syntax of XML documents: • XML Information Set • DOM model • XPath 1.0 model • XQuery 1.0 and XPath 2.0 data model Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.
2. XML documents as trees <poem author = ”Murasaki Shikibu” born = ”974”> <!-- The poem is translated from Japanese by Kenneth Rexroth --> <line>This life of ours would not cause you sorrow</line> <line>if you thought of it as like</line> <line>the mountain cherry blossoms</line> <line>which bloom and fade in a day. </line> </poem>
2. XML documents as trees Node types of XPath 1.0 poem poem born 974 Author Murasaki Shikibu line line line line which bloom and fade in a day. the mountain cherry blossoms if you thought of it as like This life of ours would not cause you sorrow The poem is translated from Japanese by Kenneth Rexroth Root node Text node Element node Comment node Attribute node
3. Entity types Physical structure of XML documents consists of entities. An entity is a unit recognized by the XML processor, the content of an entity is text or other kind of data. 19
3. Entity types 3-dimensional categorization: • parsed entities -- unparsed entities • internal entities -- external entities • general entities -- parameter entities 20
3. Entity types parsed entity intended to be parsed by the XML processor, content consists of marked-up text unparsed entity not intended to be parsed by the XML processor, content can be whatever data 21
3. Entity types internal entity name and value given in an entity declaration always a parsed entity external entity not internal parsed or unparsed 22
3. Entity types general entity used in elements and attributes parsed or unparsed internal or external parameter entity used in the document type definition always parsed internal or external 23
3. Entity types Alternatives 24
3. Entity types UNPARSED ENTITIES: • files not intended for XML processing but referred to by entity references in the INPUT FILES INPUT FILES for XML processing: XML processor Information about: application • root entity, external subset of DTD • other files intended for XML processing • elements and attributes • comments • processing instructions • character data • namespaces • notations and locations of unparsed entities INTERNAL ENTITIES: • name and textual content given in DTD 25
4. Entity declarations and references EntityDecl ::= GEDecl | PEDecl GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>' EntityDef ::= EntityValue | ( ExternalID NDataDecl?) PEDef ::= EntityValue | ExternalID entity definition for internal entity entity definition for external entity 26
4. Entity declarations and references internal entity name and value ( = literal value) given <!ENTITY % Shape "(rect | circle | poly | default )"> <!ENTITY JY "Jyväskylän yliopisto"> name literal value 27
4. Entity declarations and references external entity name and system identifier (possibly together with public identifier) given, for an unparsed entity also notation <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "xhtml-symbol.ent"> <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "xhtml-special.ent"> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html <!ENTITY virtuaaliyliopistouutiset SYSTEM "http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml"> 28
4. Entity declarations and references Unparsed entity <!ENTITY image1 SYSTEM "../images/birdnest.gif” NDATA gif> notation name The notation must have been declared, for example: <!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN" > 29
4. Entity declarations and references References to parameter entities: %Shape; %HTMLsymbol; References to parsed general entities: &JY; &virtuaaliyliopistouutiset; Reference to an unparsed general entity: <poem image="image1"> The type of the attribute has to be ENTITY or ENTITIES 30
4. Entity declarations and references In addition to entity references, XML documents may contain character references. Refers to a specific character of Unicode Provides a decimal or hexadecimal representation of the character’s code point in Unicode Example: " One-character entity defined: <!ENTITY quot """> 31
4. Entity declarations and references Where an entity or character reference can occur? reference to can occur in 32
5. XML processor treatment of entity references References to unparsed entities Validating processor makes the identifiers for the entities and associated notations available to the application. <poem image=”figure1"> <!-- From a poem of Aale Tynni --> <line>Seisoin ikkunassa ja nauroin. Ihana puu.</line> <line>Ihana pesä.</line> </poem> 33
5. XML processor treatment of entity references References to parsed entities Dealing with two kinds of entity values: literal value - the character string written between quotes in the entity definition replacement text - derived by replacing the character references and parameter entity references in the literal value by their character values and replacement texts, respectively. The XML processor replaces the entity reference by its replacement text. 34
5. XML processor treatment of entity references <!ENTITY rhyme1 "<rhyme xml:lang="fi"> <line>Ole aina iloinen</line> <line>niin kuin pikku varpunen</line> </rhyme>"> entity declaration entity reference <rhymecollection> &rhyme1; </rhymecollection> replacement text = literal value <rhyme xml:lang="fi"> <line>Ole aina iloinen</line> <line>niin kuin pikku varpunen</line> </rhyme> 35
5. XML processor treatment of entity references <!ENTITY % StyleSheet ”CDATA”> <!-- style sheet data --> <!ENTITY % Text ”CDATA”> <!-- used for titles etc. --> <!ENTITY % coreattrs ”id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED”> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html literal value of coreattrs:id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED replacement text ofcoreattrs:id ID #IMPLIED class CDATA #IMPLIED style CDATA #IMPLIED title CDATA #IMPLIED 36
5. XML processor treatment of entity references Exercise 10 (Course Text, Chapter 5) Entity declaration from XHTML Strict-DTD: <!ENTITY % Block ”(%block; | form | %misc; )*”> What is the (a) literal value (b) replacement text of entity Block (a) literal value: (%block; | form | %misc; )* 37
5. XML processor treatment of entity references Other entity declarations needed from the DTD: <!ENTITY % heading ”h1| h2| h3| h4| h5| h6”> <!ENTITY % lists ”ul | ol | dl”> <!ENTITY % blocktext ”pre | hr | blockquote | address”> <!ENTITY % block ”p | %heading; | div | %lists; | %blocktext; | fieldset | table”> <!ENTITY % misc.inline ”ins | del | script”> <!ENTITY % misc ”noscript | %misc.inline;”> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html 38
5. XML processor treatment of entity references Deriving the replacement text of Block : references to parameter entities in the literal value (%block; | form | %misc;)*replaced by their replacement texts. Literal value of block: p | %heading; | div | %lists; | %blocktext; | fieldset | table Replacement text of block: p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table Literal value of misc : noscript | %misc.inline; Replacement text of misc: noscript | ins | del | script Replacement text of Block: (p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table | form | noscript | ins | del | script )* 39
6. Motivations for the use of entities The use of entities supports: • use of non-textual data (audio, graphics, etc.) in XML documents (but can be added also in stylesheets) • modularization of documents • consistency • multiuse of definitions • adding semantic information by informative entity names and comments attached to entity declarations 40