310 likes | 493 Views
The Construction of Anglo-Norman Text Corpus. Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth . AHRC-funded. Anglo-Norman Online Dictionary Anglo-Norman Text Corpus http://www.anglo-norman.net. Goal of the Anglo-Norman Hub Text Digitisation Project.
E N D
The Construction of Anglo-Norman Text Corpus • Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth . • AHRC-funded. • Anglo-Norman Online Dictionary • Anglo-Norman Text Corpus • http://www.anglo-norman.net
Goal of the Anglo-Norman Hub Text Digitisation Project • To provide a set of digitised texts and articles to mediaeval linguists and historians which is searchable and fully cross-referenced within itself and to and from the Anglo-Norman Online Dictionary
Main Challenges facing the Anglo-Norman Hub Project • Image to text migration for maximum throughput at minimum cost • Application of markup suitable for rendering and full cross-referencing • Handling of non-standard character sets (mediaeval abbreviations)
Image to Text Migration Strategies • Optical Character Recognition • Re-keying • Both require subsequent proofreading • Both allow insertion of appearance metadata as provisional markup
OCR Rapid processing Can be performed by students on-site and can be supervised. Rekeying Less error-prone Cheap if outsourced Non-standard characters can be represented by combinations More consistent output quality Image quality less critical Consistent output quality Advantages of Alternative Image to Text Migration Strategies
Economic Image to Text Migration: Conclusions • Re-keying is more economic for the bulk of the mediaeval-language material • OCR is competitive for modern languages (critical material) • OCR can also be used for mediaeval language material when required by workflows provided that • good image quality can be easily achieved • the material consists of standard characters
Markup requirements: must • Conform to widely-accepted standards • Be capable of encapsulating diverse document structures • Allow for automation • Enable internal and external referencing • Preserve as much appearance metadata as possible • Not be tied to any one approach to rendering
Document types requiring a variety of XML Structures • Texts • Verse • Prose • Lists & Tables • Critical material • Introductions (conform to prose structures) • Notes (do not conform to any of the above structures)
Cross-referencing of Critical Matter • Need to navigate from pointer to note • Need to navigate cross-references from critical material to specific points in the text or elsewhere in critical material • Achieved by use of target-id pairs
Markup Density and Automation • Verse: medium density; can be automated • Prose: variable density; can be automated if footnote pointers present • Lists & tables: medium density; can be automated • Critical material: high-density; many cross-references; limited scope for automation
Extract from XML version of “La Passiun de St. Edmund” • <lg n="316"><l id="L1261">A Deu del cel ad graciéd</l> • <l id="L1262">E al martir suvent a voéd</l> • <l id="L1263">Que si bel l'at delivréd</l> • <pb ed="folio" n="123a"/><l id="L1264" n="1264">De ço qu'esteit ainz encumbrét.</l></lg>
Extract from XML version of “La Passiun de St. Edmund” • <note id="N1261-4" target="L1261" targetEnd="L1264">These lines present several problems: (a) <q lang="AN" rend="b">A Deu. . .ad graciéd</q> <ref target="L1261">1261</ref>. The verb <term lang="AN" rend="i">gracier</term>, occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: <ref target="L826 L943 L1132">ll. 826, 943, 1132</ref>.
Additional Markup for Critical Material • <term>: Terms discussed may need to be linked to the Anglo-Norman Dictionary • <q>: Citations: may need to be linked to their sources within the text base • <bibl>, <title> etc.: Bibliographical information needs to be encoded to link citations with their sources • Much of the above can be extrapolated from the appearance metadata embedded in the provisional markup • <hi>: to encode embedded appearance metadata whose significance is not apparent
“La Passiun de St. Edmund”Rendered for a Web Browser • These lines present several problems: (a) A Deu. . .ad graciéd 1261 . The verb gracier , occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: ll. 826, 943, 1132 . T.-L. 4,502 cites one instance of gracier with indirect object, but in the construction gracier. qc. a qn . If this construction were applied here, ll. 1263-4 would have to be taken as the direct object of gracier and also, presumably, of voer 1262 . The use here of gracier with indirect object may have been influenced by the construction rendre graces a qn. employed at ll. 995, 1046, 1512 .
Markup Density and Automation • Verse: medium density; can be automated • Prose: variable density; can be automated if footnote pointers present • Lists & tables: medium density; can be automated • Critical material: high-density; many cross-references; limited scope for automation
Markup Requirements: Application • 1,000 to 100,000 XML tags per document • Automation essential for high throughput • Digitisers can embed appearance metadata in provisional markup • Well-designed provisional markup schemes facilitate automation
Extract from the explanation published with the Statutes, exemplifying the two forms resembling 9s.
Handling of Non-Unicode Characters: 1) Transcription • Transcription is the one-to-one encapsulation of character appearance metadata • Transliteration is the expansion of abbreviated characters into an intelligible sequence of letters • Transliteration requires transcription as a starting point • Transcription codes must resemble originals to facilitate re-keying
Examples of the "per" "pro" and "pre" contractions as represented by the agency
Transcriptionto Transliteration:Rekeyed Version & XML File <p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos & consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum & stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>
Handling of Non-Unicode Characters: 2) Transliteration • Manual transliteration would take too long • Blanket replacement is not possible because of ambiguous abbreviations • Semi-automated transliteration can be achieved using a list of words for block-replacement, derived from a concordance • The appearance metadata from the transcription should remain embedded
Transliteration:XML File & Rendered Output <p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos & consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum & stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>
Main Challenges facing the Anglo-Norman Hub Project • Image to text migration for maximum throughput at minimum cost • Application of markup suitable for rendering and full cross-referencing • Handling of non-standard character sets (mediaeval abbreviations)