E N D
Phonetic characters in digital editionsTomaž Erjavec1 & Matija Ogrin2tomaz.erjavec@ijs.si, matija.ogrin@zrc-sazu.si1 Department of Knowledge TechnologiesJožef Stefan InstituteLjubljana2 Institute of Slovenian Literature and Literary SciencesScientific Research Centre of the Slovenian Academy of Sciences and Arts, Ljubljana SloFon 21 April 2006
Overview of the talk • IPA • PUA • TEI
The problem • providestandardised encoding (XML) and Web viewing (HTML) of complex digital editions • in particular, the Freising manuscripts (e-BS) • work in progress in the project “Scholarly Digital Editions of Slovenian Literature” http://nl.ijs.si/e-zrc/
Focus of the talk • e-BS, a very complex document:facsimile, commentary, diplomatic and critical trascriptions, translations, dictionary, bibliography, name index, … • but also: • phonetic transcription in IPA • (recording)
IPA • International Phonetic Alphabet(International Phonetic Association) • contains not-well supported characters, e.g.ɐ, ɕ, ɚ, ɷ • heavy use of diacritics: • unusual diacritical marks: ˀ ˒ ˤ • more than one diacritic: ǡ • diacritics spanning digraphs:
Computer representation of IPA SAMPA (for HLT) • transliteration to ASCII • SAMPA for contemporary Slovenian: • http://www.phon.ucl.ac.uk/home/sampa/sloven-uni.htm • ZEMLJAK, Melita, KAČIČ, Zdravko, DOBRIŠEK, Simon, ŽGANEC GROS, Jerneja, WEISS, Peter. Računalniški simbolni fonetični zapis slovenskega govora. Slav. rev., apr.-jun. 2002, 50/2, 159-169. UNICODE(for humans) • universal character set, better and better supported • contains “IPA Extensions”, “Combining diacritical marks” • various good Unicode IPA fonts available, e.g. Doulos SIL • for non-standardised characters: Private Use Area (PUA) • not to be used lightly!
ZRCola • developed at ZRC SAZU (Peter Weiss) • Unicode input system for linguistic use in WinWord program: • decomposed and composed characters: • keyboard input • font which covers historical characters as well as IPA & (now) some specifics of e-BS ideal for use in e-BS
Why PUA? ZRCola font uses PUA mostly for • defining new Slovene (related) historical characters • composed characters with diacritics (+ digraphs), for better diacritic placement • Unicode offers Combining diacritical marks, but complex stacks can cause problems for font rendering
Some comparissons PUA EB25 ZRCola mapping to r+0300+0329 Times NR r̩̀ MS Tahomar̩̀ Doulos SILr̩̀ PUA EEC8 ZRCola ~mapping to t+j+032E Times NR tj̮ MS Tahoma tj̮ Doulos SIL tj̮ PUA E31B ZRCola mapping to 0105+0307 Times NR ą̇ MS Tahoma ą̇ Doulos SIL ą̇ PUA E35E ZRCola mapping to 00E6+0303+0300 Times NR æ̃̀ MS Tahoma æ̃̀ Doulos SILæ̃̀
Problem • PUA = Private Use Area but • e-ZRC = standardised & interchangable How to retain the benefits of ZRCola, yet make e-BS interchangable? How to enable reading e-BS for platforms without the ZRCola font?
Text Encoding Initiative • e-ZRC editions encoded in XML • using the Text Encoding Initiative Guidelines, TEI P4 • TEI P5 makes provisions for encoding PUA characters and glyphs • in TEI P4 user extensions are necessary to achieve the same effect
PUA in TEI P5 • TEI P5 chapter25. Representation of non-standard characters and glyphs • markup in text to identify PUA characters or glyphs • link these elements to their TEI header definition • TEI header can give, for each new character: • a name (text description a la Unicode), e.g. LATIN SMALL LETTER A • mapping to standard Unicode • character properties • rendering software (e.g. XSLT stylesheet for conversion to HTML) can then use the PUA version, or the standard version
Markup in the document • text:b:ʒɛ g:spɔdi miłɔstíwi :tɛ b:ʒɛ tɛbǽ ispɔwǽdæ • in XML: <line n="2" id="bsPT.1.002"> b<g corresp="zrcolaE656"/>:ʒɛ g<g corresp="zrcolaE656"/>:spɔdi miłɔstíwi <g corresp="zrcolaE656"/>:t<g corresp="zrcolaEECC"/>ɛ b<g corresp="zrcolaE656"/>:ʒɛ tɛbǽ ispɔwǽdæ </line>
Markup in the header PUA characters are defined in teiHeader/encodingDesc: <charDesc> <desc>PUA characters as defined by <xref url="http://zrcola.zrc-sazu.si/">ZRCola</xref> Character descriptions taken from and based on The Unicode Standard 4.1U41M050317.lst </desc> <char id="zrcolaE31B"> <charName>LATIN SMALL LETTER A WITH OGONEK AND DOT ABOVE</charName> <charProp><localName>font</localName><value>ZRCola</value></charProp> <charProp><localName>mapping</localName><value>exact</value></charProp> <mapping type="PUA"></mapping> <mapping type="standard">ą<!--LATIN SMALL LETTER A WITH OGONEK-->̇<!--COMBINING DOT ABOVE--></mapping> </char> <!-- more chars --> </charDesc>
Standardisation of ZRCola PUA • ZRCola very well documented “visually”, i.e. for humans • but lacking machine processable meta-data:Unicode compliant name • mapping to standard Unicode (identity, similarity) • we only implemented 50+ characters that actually appear in eBS • substantial work to describe all PUA characters in ZRCola distribution • maybe better to abandon the precomposed PUA characters that can be expressed in standard Unicode?
TEI to HTML <xsl:template match="g"> <xsl:variable name="glyph" select="id(@corresp)/mapping[@type=$ENCODING]"/> <SPAN> <xsl:if test="$ENCODING = 'standard'"> <xsl:attribute name="class"> <xsl:value-of select="id(@corresp)/charProp[localName='mapping']/value"/> </xsl:attribute> </xsl:if> <xsl:attribute name="title"> <xsl:value-of select="id(@corresp)/charProp[localName='font']/value"/> <xsl:text>: </xsl:text> <xsl:value-of select="id(@corresp)/charName"/> </xsl:attribute> <xsl:value-of select="$glyph"/> </SPAN> </xsl:template>
Conclusions • introduced IPA, PUA & TEI • showed how PUA characters can be, via TEI, made • interchangable • documented • flexibly presented • this does require investment of time by the designers of PUA characters