1 / 30

lis512 lecture 5

lis512 lecture 5. XML based metadata formats. constraints on XML. XML defines a genaral syntax for documents. These documents can contain records. Records are put together in an XML document. But to makes sense of XML, we need something more. valid XML.

carol
Download Presentation

lis512 lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. lis512 lecture 5 XML based metadata formats

  2. constraints on XML • XML defines a genaral syntax for documents. • These documents can contain records. • Records are put together in an XML document. • But to makes sense of XML, we need something more.

  3. valid XML • XML that is well-formed may be valid against a set of constraints. • Such a set of constraints can be formally written down. • An old form of a formally written set of constraints is called a DTD. • More recently schema files have become more popular, they do the same thing.

  4. example: XHTML DTD • [some of this not quite true] • XHTML is an XML DTD. • Basically it is a set of constraints that an XML documents has to follow if it is to be a web page. • These constraints are written out in a file called the XHTML DTD.

  5. sample HTML constraints • The root element of a HTML document is called <html>. • It admits two child elements, called <head> and <body>. • There must be one <head> and one <body>. • The <head> appears first, then appears the <body>. • <html><head>...</head> <body>...</body></html>

  6. more HTML constraints • All elements in the body are classified as text level and block-level. • A text-level element has to be a child of a block-level element. • A text-level element can not have block-level elements as children.

  7. more HTML constraints • There is an element <img> that can be placed in the body. It is a text level element. It is always an empty elemnet. It requires an attribute called src= and an attribute called alt=. It may take a bunch of other attributes. • The fact that <img> creates an image is understood. It is not something that can be formally specified in the DTD.

  8. meaning what • meaning that a computer can check that there is an src= attribute pointing to an image file. • It can check that the image can be accessed (at this time) and • It can not check the image is “appropriate” (e.g. it is the American flag rather than the Russian flag)

  9. other XML formats • I will mention here a few other formats that are used in the bibligraphic universe. • Contrary to HTML, which is really a document format, these formas are used to encode certain record. • One we have already seen, MARC XML.

  10. constrain languages • There are basically languages that are used to model constraints on XML documents. • DTDs • XML Schema files • Relax NG • The don't have the same capabilities. Some can implement constraints that others can't. • No language is the to implement all constraints possible.

  11. DTDs • DTDs were define with a parent format of XML called SGML. • DTDs are not written in an XML syntax, they have their own syntax. • No software has been written that fully implements SGML and its DTDs. • DTDs are now considered a legacy format.

  12. XML Schema • It's a WC3 sponsored format that aimed to be the ultimate schema language. • It is quiet abominably complicated and fails miserably in certain areas. For example, it has poor support for unordered contents. • A look at http://www.w3.org/TR/xmlschema-1/ will help.

  13. Relax NG • This is a more informal standard for constraining XML documents. • It is much easier to learn and use than XML Schema. • Free software implements the entire standard.

  14. in the archival world • Documents and other artifacts held in archives are usually not catalogued item by item. That would be too expensive. • Instead archivists create "finding aids", documents that describe what can be found in an archival collection. • A large archive may have hundreds of finding aid, usually on per box of "stuff".

  15. encoded archival description • http://www.archivists.org/saagroups/ead/aboutEAD.html says "EAD stands for Encoded Archival Description, and is a non-proprietary de facto standard for the encoding of finding aids for use in a networked (online) environment. Finding aids are inventories, indexes, or guides that are created by archival and manuscript repositories to provide information about specific collections. (cont.)

  16. continuation of previous quote. • While the finding aids may vary somewhat in style, their common purpose is to provide detailed description of the content and intellectual organization of collections of archival materials. EAD allows the standardization of collection information in finding aids within and across repositories."

  17. EAD history • EAD started in 1993 (just as RePEc) at the University of California as a project lead by Daniel V. Pitti. • EAD is formally specified as an SGML DTD, but now also as XML Schema file and ReLAX NG Schema. • The current version is called EAD 2002. It is maintained by the Library of Congress.

  18. other formats • There are two modern versions of EAD around. • A RelaxNG form http://www.loc.gov/ead/ead.rng • An XML Schema http://www.loc.gov/ead/ead.xsd • Both RelaxNG and XML Schema are schema languages. They are ways to specify constraints on XML documents.

  19. EAD problem • EAD is very difficult to implement for archivist who generally have low IT skills. • They are then reliant on somebody else to do it for them. Such personnel does not come cheap. • See Sonia Yaco’s paper: "It’s Complicated: Barriers to EAD Implementation", based on Survey work by Anlex consulting. • Library of Congress example appears invalid!

  20. in the humanities: TEI • The TEI is the Text Encoding Initiative. • It is a DTD, Relax NG and XML Schema specification, with documentation, on how to encode texts. • The texts they talk about are mainly historic and cultural artifacts, • poetry • historical documents

  21. problems to address • How to represent faithfully a printed text in an XML form? • imitate page appearance • character recognition and normalization • guessing formal structure from visual appearance.

  22. example: Langland's Piers Plowman • <l> <seg>In a somerseson,</seg> <seg>whansofte was the sonne,</seg></l><l> <seg>I shoop me into shroudes</seg> <seg>as I a sheep were,</seg></l><l> <seg>In habite as an heremite </seg> <seg>unholy of werkes,</seg></l><l> <seg>Went wide in this world </seg> <seg>wondres to here.</seg></l>

  23. example from Pope's Essay on Criticism <div type="book" n="1" met="-+|-+|-+|-+|-+/" rhyme="aa"> <lg n="1" type="paragraph">  <l>'Tis hard to say, if greater Want of Skill</l>  <l>Appear in <hi>Writing</hi> or in <hi>Judging</hi> ill;</l>  <l>But, of the two, less dang'rous is th'Offence,</l> <l>To tire our <hi>Patience</hi>, than mis-lead our <hi>Sense</hi>:</l> </lg> </div>

  24. example for manuscript description <handDesc hands="3"> <handNotexml:id="Eirsp-1" scope="minor"> <p>The first part of the manuscript,  <locus from="1v" to="72v:4">fols 1v-72v:4</locus>, is written in a practised     Icelandic Gothic bookhand. This hand is not found elsewhere.</p> </handNote> <handNotexml:id="Eirsp-2" scope="major"> <p>The second part of the manuscript, <locus from="72v:4" to="194v">fols 72v:4-194</locus>, is written in a hand contemporary with the first; it can also be found in a fragment of <title>Knýtlinga saga</title>,  <ref>AM 20b II fol.</ref>.</p> </handNote> <handNotexml:id="Eirsp-3" scope="minor">  <p>The third hand has written the majority of the chapter headings. This hand has been identified as the one also found in <ref>AM 221 fol.</ref>.</p> </handNote> </handDesc>

  25. ONIX • ONIX is a record format used in the book trading and publishing industry. • The London based EDItEUR group, a membership based organization maintains the standard.

  26. purpose • ONIX is essentially related to product descriptions in the publishing industry. • Publishers can use it to communicate with vendors. • Publishers can use it internally to streamline procedures. • There are there formats • books • serials • rights

  27. MODS • This is the Metadata Objects and Description Schema. • It is basically a MARC 'light'. • By converting MARC to MODS, you loose some information. • The resulting information is less detailed and less complicated.

  28. METS • http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf says • "The Metadata Encoding and Transmission Standard (METS) is a data encoding and transmission specification, expressed in XML, that provides the means to convey the metadata necessary for both the management of digital objects within a repository and the exchange of such objects between repositories (or between repositories and their users)."

  29. METS • METS is as container structure that does not only allow data about an object of interest be encoded, but also allows for the object itself to be encoded. • As a result, it is very complicated.

  30. Journal Publishing Tag Set • This XML DTD is issued by the National Library of Medicine. • It is the format followed by records in the PubMed database. • PubMed is the largest sort-of freely available bibliographic dataset for scientific articles. • Because of its market power, the tag set is spreading through the publishing industry.

More Related