380 likes | 552 Views
TEXT ENCODING INITIATIVE (TEI). Inf 384C Block II, Module C. TEI History. The developing organizations first met in 1987 Association for Computers and the Humanities (ACH) Association for Computational Linguistics (ACL) Association for Literary and Linguistic Computing (ALLC)
E N D
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C
TEI History • The developing organizations first met in 1987 • Association for Computers and the Humanities (ACH) • Association for Computational Linguistics (ACL) • Association for Literary and Linguistic Computing (ALLC) • 1990—first Version TEI P1 • 1992—TEI P2 • 1993—TEI P3
TEI History Continued • Principles for the development of TEI • Standard format for data interchange in humanities research • Guidelines for encoding texts in the same format • Define a recommended syntax • Define a meta language for description of text-encoding schemes • Future Developments • Linguistic description and grammatical annotation • Historical analysis and interpretation • Base tag sets for further document types • Manuscript analysis and physical description of text
The Evolution of SGML and XML • 1960’ Generalized Markup Language by IBM 1960’s • 1970’s & 1980’s ANSI initiates project to develop a Standard text-description language based on GML • 1983 SGML became an industry standard • 1986 ISO ratified a standards for SGML • 1990’s Tim Berners-Lee developed HTML a simple formatting markup language for the World Wide Web • Mid 1990’s XML was developed by the W3C to combine the flexibility of SGML and the simplicity of HTML
Benefits of SGML and XML • SGML is a toolkit for developing specialized markup languages • Specifies the structure of information • Enables interoperability between multiple platforms • Acts like a database • ail encompassing • The DTD acts as a blueprint for document structure • XML provides a manageable framework in which you can define your own elements
XML Syntax • Information content must have start and end tags • Case is significant • Elements may not overlap • Elements can nest one inside another
The XML Environment • XML Editor • XML Parser/Validator • Display program • DTD or schema to define elements • Style sheet for display of elements
The XML Document • Document prologue • XML declaration • Document type declaration • Points to root element • Points to external standards (DTDs, namespaces) • Document itself • Bracketed by root element • Contains elements, attributes, entities
The DTDDocument Type Definition • DTD defines a document’s structure i.e. it is a set of rules and declarations that specify what tags can be used and what these tags can contain • DTD validates documents - determines which documents conform to language - reduces possibility of errors • DTD provides blueprint for documents - specifies how to handle elements - specifies which elements are allowed
The DTDDocument Type Definition • The DTD has four main functions: 1.declares a set of allowed elements “vocabulary” 2. defines content model for each element “grammar” 3. declares set of allowed attributes for each element 4. provide various mechanisms to make management of model easier (Ray, Chapter 5, p 148)
Basic Structure of DTD-Element Declaration- <!Element name (content-model)> Holds two functions: • Adds a new element • States what can go inside the element • For every element that appears in the document, one must be identified in the DTD • Order of declarations is important
“vocabulary” Denotes NAME of element that appears in mark-up tag (case-sensitive-LOWER) e.g. title, graphic, article, thingie “grammar” Formula that delineates what kind of content, how many and in what order Empty elements: EMPTY No content restrictions (little value): ALL Only character data, no elements: #PCDATA Only elements: formula Mixed Content: content model <!Element name (content-model)>
Basic Structure of a DTD-Attribute Declaration- <!attlistname(attname1 atttype1 attdescl1) (attname2 atttype2 attdescl2)> For each element that appears in document, attributes of the element must be declared All attributes are declared in one place, attribute list
“vocabulary” Name of element to which the attributes belong Same as name as element declared earlier e.g. title, article, thingie “Attribute declarations” attname1 Gives attribute name atttype1 Specifies datatype of attribute, list of values CDATA, NMTOKEN, ID attdesc1 Describes behavior 1. default value “high” 2. author specified value #REQUIRED, #FIXED, #IMPLIED <!attlist name (attname1 atttype1 attdescl1)>
The DTDDocument Type Definition “It is important to remember that every document type definition is an interpretation of a text. There is no single DTD which encompasses any kind of absolute truth about a text, although it may be convenient to privilege some DTDs above others for particular types of analysis.” TEI Guidelines for Electronic Text Encoding and Interchange http://etext.virginia.edu/TEI.html
The TEI DTD • Uses basic structural elements of general DTD • Designed to simplify the task of choosing an appropriate set of tags for the text in hand. • Selects appropriate combination of smaller tag sets, each containing some set of tags likely to be used together 1. core tag sets – standard components that are always included, no encoder action 2. basic tag sets – basic building blocks for text types, encoder must select at least one 3. additional tag sets – extra tags compatible with all other tag sets, encoder may add with basic tags in any combination http://www.tei-c.org/P4X/DTD/
Basic Elements of TEI • Paragraphs <p> • Punctuation <stop.abbr>, <stop.sent> • Quotations <q> or <quote> • Lists <list>, <item> etc. • Bibliographic Citations <bibl> • THE HEADER! <teiHeader>
The TEI Header • Required of every TEI text, composed of four parts • May be large and complex or very simple • The header may differ for documents not based on written text, such as computer files or spoken text • The header is not a library cataloging record, although the intent is similar
Four Parts • File Description <fileDesc> • Encoding Description <encodingDesc> • Text Profile <profileDesc> • Revision Description <revisionDesc>
File Description <fileDesc> • <titleStmt> • <editionStmt> • <extent> • <publicationStmt> • <seriesStmt> • <notesStmt> • <sourceDesc>
Encoding Description <encodingDesc> • <projectDesc> • <samplingDecl> • <editorialDecl> • <tagsDecl> • <refsDecl> • <classDecl> • <fsdDecl> • <metDecl> • <variantEncoding>
Profile Description <profileDesc> • <creation> • <langUsage> • <textClass>
Revision Description <revisionDesc> • <revisionDesc> • <change>
Examples and Application • Dumble Geological Survey • A Geological survey of Texas from the late 19th Century comprised of twelve volumes • Digitally imaged monographs processed with OCR software to produce text • Text marked up in XML using the TEI Lite specifications • http://www.lib.utexas.edu/books/dumble/
Dumble DTD • Element and Attribute definitions • Entity references
Dumble Header • Four basic sections • File description • Encoding description • Profile description • Revision description • Contains bibliographic information • Contains information on the creation of the digital file
Why XML? • Ability to record information about a document within the document. • Ability to separate structure from format • Ability to “wrap” or embed information in layers of xml
XML Beyond TEI • Open Archives Initiative (OAI) • Semantic Web • Open Archival Information System • Digital Preservation • Information Discovery
A Sample TEI Markup Appendix A.2 Elements in TEI Lite OAI OAIS Learning XML www.tei-c.org/Lite/U5-eg.html www.tei-c.org/Lite/U5-taglist.html www.openarchives.org/ http://www.rlg.org/longterm/oais.html Erik T. Ray References