280 likes | 290 Views
Explore TEI, a text encoding standard that makes text structure explicit, enabling reliable text processing. Learn about XML, schema languages, TEI development, and the legacy and goals of TEI.
E N D
Introduction to TEITomaž Erjavecdept. of knowledge technologiesJožef Stefan InstituteLjubljana, Slovenia
Overview • Introduction to text markup • What is TEI • Some examples of usage
The ontology of text • Where is the text? • in the shape of letters and their layout? • in the original from which this copy derives? • in the ideas it brings forth? in their format, or their intentions? • Texts are abstractions conjured up by readers. • Markup encodes those abstractions.
Encoding of texts • Texts are more then sequences of encoded characters • they have structure and content • they also have multiple readings • Encoding, or markup, is a way of making these things explicit • Only that which is explicit can be reliably processed
Some definitions • Markup makes explicit the distinctions we want to makewhen processing a string of bytes • Markup is a way of naming and characterizing the partsof a text in a formalized way • It’s (usually) more useful to markup what things arethan what they look like
What does markup capture? Compare <head>Upon Julia’s Clothes</head> <lg> <l>Whenas in silks my <hi>Julia</hi> goes,</l> <l>Then, then (me thinks) how sweetly flowes</l> <l>That liquefaction of her clothes.</l> </lg> and <s n="1" role="head"> <w type="pp">Upon</w> <w type="np">Julia</w><w type="pos">’s </w> <w type="nn2">Clothes</w> </s> <s n="2" role="line"> <w type="adv">Whenas</w> <w type="pp">in</w> <w type="nn2">silks</w> ... </s>
What is the point of markup? • To make explicit (to a machine) what is implicit (to a person) • To add value by supplying multiple annotations • To facilitate re-use of the same material • in different formats • in different contexts • for different users
XML • XML is structured data represented as strings of text: • XML is extensible • XML must be well-formed • XML can be validated • XML is application-, platform-, and vendor- independent • XML empowers the content provider and facilitates data integration
Schema languages • XML schemas are used to: • define the element and attribute vocabularies for particular text types • define content models for elements • define data types of attributes (and elements) • Schemas can be written in: • XML DTD Language • W3C schema language • ISO Relax NG schema language • (TEImostly uses Relax NG)
Developing schemas • For simple annotations, one can define a project-specific schema from scratch • But if the markup will be complicated, it is better to take one of the standard schemas • Using standard schemas means: • better documentation • better interchange • better tool support • There are many schemas around, but only one initiative delals with encoding arbitrary texts for scholarly purposes
Text Encoding Initiatve The TEI provides a framework for the definition of multiple XML schemas • it defines and names several hundred useful textual distinctions • it provides a set of modules that can be used to define schemas making those distinctions • it provides a customization mechanism for modifying and combining those definitions with new ones using the same conceptual model
Where did the TEI come from? • Originally, a research project within the humanities • Sponsored by three professional associations • Funded 1990-1994 by US, EU • Major influences • digital libraries and text collections • language corpora • scholarly datasets • International consortium established June 1999(see http://www.tei-c.org/)
Goals of the TEI • better interchange and integration of scholarly data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode — hence, a user-driven codification of existing best practice • assistance for the specialist: how to encode — hence, a loose framework into which unpredictable extensions can be fitted These apparently incompatible goals result in a flexible and modular environment
TEI Guidelines • A set of recommendations for text encoding, covering both generic text structures and some highly specific areas based on (but not limited by) existing practice • A very large collection of element definitions with associated declarations for various schema languages • a modular system for creating personalized schemas from the foregoing for the full picture seehttp://www.tei-c.org/Guidelines/
Legacy of the TEI • a way of looking at what ‘text’really is • a codification of current scholarly practice • (crucially) a set of shared assumptions and priorities about the digital agenda: • focus on content and function (rather than presentation) • identify generic solutions (rather than application-specific ones)
Users of TEI • Over 100 projects listed on the TEI project page • Main areas of use: • digital libraries • text-critical editions • computer corpora • dictionaries
Versions of the Guidelines • TEI P3 (1994) first public version: • SGML + book (1200pp) and soon also on the Web. • TEI P4 (2002): • provides equal support for XML and SGML applications using the TEI scheme; • error correction, while maintaining backward compatibility: documents conforming to TEI P3 will not become illegal when processed with TEI P4. • TEI P5 (2007): • implements more fundamental changes to the schemas, in line with current practice and identified problems, e.g. uses namespaces • no longer backward compatible with P3, P5 • Relax NG becomes the main schema language • continuous improvement..
TEI modules • TEI is too general to be supported by a single schema • Rather, TEI is composed of modules, and which modules the user select is determined by the project needs • Some examples of modules: • Transcription of spoken texts • Dictionaries and lexica • Varieties of linguistic annotation • Nonstandard characters and glyphs • Linking, alignment, non-hierarchic structures • Detailed metadata (the TEI Header) • Manuscript description • Text-critical apparatus
Support offerred by TEI • Web interface to make XML schemas from a TEI parametrisation • A set of XSLT stylesheets to convert TEI/XML to HTML or PDF • Mailing list tei-l • Various tutorials available from the TEI pages • Yearly conference and members‘ meeting
Examples of applications Mostly work done by me in collaboration with other people institutions: • Annotated corpora • Machine readable dictionaries • Text-critical editions • Biographical databases
<s xml:id="F0203.557.2"> <w xml:id="F0203.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/> <w xml:id="F0203.557.2.2" lemma="biti" msd="Gp-ste-n">je</w><S/> <term type="sloWNet" sortKey="kraj" key="ENG20-08114200-n"> <w xml:id="F0203.557.2.3" lemma="turističen" msd="Ppnmein">turističen</w><S/> <w xml:id="F0203.557.2.4" lemma="kraj" msd="Somei">kraj</w> </term> <c xml:id="F0203.557.2.5">.</c><S/> </s> <linkGrp type="syntax" targFunc="head argument" corresp="#F0203.557.2"> <link type="ena" targets="#F0203.557.2.2 #F0203.557.2.1"/> <link type="modra" targets="#F0203.557.2 #F0203.557.2.2"/> <link type="dol" targets="#F0203.557.2.4 #F0203.557.2.3"/> <link type="dol" targets="#F0203.557.2.2 #F0203.557.2.4"/> <link type="modra" targets="#F0203.557.2 #F0203.557.2.5"/> </linkGrp> JOS corpus <s xml:id="F0203.557.2"> <w xml:id="F0203.557.2.1" lemma="ta" msd="Zk-sei">To</w><S/> <w xml:id="F0203.557.2.2" lemma="biti" msd="Gp-ste-n">je</w><S/> <term type="sloWNet" sortKey="kraj" key="ENG20-08114200-n"> <w xml:id="F0203.557.2.3" lemma="turističen" msd="Ppnmein">turističen</w><S/> <w xml:id="F0203.557.2.4" lemma="kraj" msd="Somei">kraj</w> </term> <c xml:id="F0203.557.2.5">.</c><S/> </s> <linkGrp type="syntax" targFunc="head argument" corresp="#F0203.557.2"> <link type="ena" targets="#F0203.557.2.2 #F0203.557.2.1"/> <link type="modra" targets="#F0203.557.2 #F0203.557.2.2"/> <link type="dol" targets="#F0203.557.2.4 #F0203.557.2.3"/> <link type="dol" targets="#F0203.557.2.2 #F0203.557.2.4"/> <link type="modra" targets="#F0203.557.2 #F0203.557.2.5"/> </linkGrp>
jaSlo dictionary <entry id="jaslo.55"> <form type="hw"> <orth type="roma">ainiku</orth> <orth type="kana">あいにく</orth> <orth type="kanji">生憎</orth> </form> <gramGrp> <pos>N/Ana/Adv</pos> </gramGrp> <trans> <tr>nesrečen</tr> <tr>nepričakovan</tr> <tr>nesluten</tr> <tr>žal</tr> </trans> <eg> <q>おあいにくさまです。</q> <tr>Žal mi je za vas.</tr> </eg> <usg type="level">2</usg> </entry>
eZISS text-critical editions <l>Na <app> <lem wit="#Drobt_1846">cesti</lem> <rdg wit="#UKM_123 #UKM_553">zeſti</rdg></app> popotnik <app><lem wit="#Drobt_1846">zdihuje. –</lem><rdg wit="#UKM_123">sdihuje. <add>–</add></rdg><rdg wit="#UKM_553">sdihuje</rdg></app> </l>Three readings: Drobt_1846: Na cesti popotnik zdihuje. – UKM_123: Na zeſti popotnik sdihuje. – UKM_553: Na zeſti popotnik sdihuje
SBL biographical database <person> <sex value="1"/><persName xml:lang="lat"> <forename>Johannes</forename> <surname>Aquila de <placeName>Rakerspurga</placeName></surname> </persName> <persName> <forename>Janez</forename> <surname>Akvila iz <placeName>Radgone</placeName></surname> </persName><persName> <forename xml:lang="hun">János</forename> <surname>Aquila</surname> </persName><occupation>slikar</occupation> <floruit notAfter="1392" notBefore="1378"> <placeName> <region>Prekmurje</region> </placeName> </floruit> </person>
Conclusions • Gave a brief introduction to TEI • For more, visit the TEI web pages!