1 / 48

Introduction to Computational Methods for Classical Philology

Introduction to Computational Methods for Classical Philology. David Bamman The Perseus Project, Tufts University http://nlp.perseus.tufts.edu/docs/xxisnec/slides/1.intro.pdf. Homer Multitext. 39-megapixel scans of the 10th-century Marcianus Graecus Z. 454 (= 822) manuscript of the Iliad.

Download Presentation

Introduction to Computational Methods for Classical Philology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Computational Methods for Classical Philology David Bamman The Perseus Project, Tufts University http://nlp.perseus.tufts.edu/docs/xxisnec/slides/1.intro.pdf

  2. Homer Multitext • 39-megapixel scans of the 10th-century Marcianus Graecus Z. 454 (= 822) manuscript of the Iliad. • Publicly released under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License by the Biblioteca Nazionale Marciana and the Center for Hellenic Studies http://chs.harvard.edu/chs/manuscript_images

  3. Physical Access • Perseus Digital Library (http://www.perseus.tufts.edu) • Latin Library (http://www.thelatinlibrary.com/) • LASLA (http://www.cipl.ulg.ac.be/lsl.htm) • Index Thomisticus (http://www.corpusthomisticum.org/it) • Documenta Catholica Omnia (http://www.documentacatholicaomnia.eu/) • TLG (http://www.tlg.uci.edu/) • Brepols corpora [BTL etc.] (http://www.brepols.net/publishers/cd-rom.htm) • Google Books (http://books.google.com/) • Internet Archive (http://www.archive.org)

  4. Perseus Digital Library

  5. “Open” Access: XML

  6. Philologic (Chicago)

  7. Archimedes (Harvard)

  8. Diogenes (Durham)

  9. Hestia (Open University)

  10. Open Source Perseus http://www.perseus.tufts.edu/hopper/opensource • 4.5 million words of Classical Latin • 4.9 million words of Ancient Greek TEI-Compliant XML

  11. Internet Archive www.archive.org 27,000+ works in Latin; 1 billion words.

  12. Intellectual Access • Large-scale linguistic analysis • Tracking language change in 2000 years of Latin • Downstream computational tasks • Automatically creating dynamic bilingual dictionaries • Discovering textual allusions

  13. Tracking Language Change • Lexical change (new vocabulary, shift in the meanings of words) • Syntactic change (including the influence of the author’s first on the Latin syntax) • Topical change (the rise of new genres) • Identifying the flow of information. E.g., Cicero + Augustine influencing Petrarch; Petrarch influencing Leonardo Bruni.

  14. 6,385 Latin works in the Internet Archive, charted by date of publication.

  15. 6,385 Latin works in the Internet Archive, charted by date of composition.

  16. “America” (1,006)

  17. “de” (2,955,462)

  18. “ad” (3,655,191)

  19. “in” (8,126,487)

  20. “et” (9,317,773)

  21. Vocabulary density in Latin authors from 200 BCE to 1900 CE (Type-Token Ratio)

  22. Intellectual Access • Large-scale linguistic analysis • Tracking language change in 2000 years of Latin • Computational tasks to extract information from texts • Automatically creating dynamic bilingual dictionaries • Discovering textual allusions

  23. Use #1: Automatically Building Bilingual Dictionaries • Based on parallel text analysis: aligning source texts (here, in Greek and Latin) to translations (English, Spanish, etc.) • Driven mainly by statistical machine translation for modern languages.

  24. Parallel Text Data The Internet Archive alone contains editions of Horace’s Odes in eight different languages. • Latin: carpe diem quam minimum credula postero (Horace, Ode 1.11) • English: Seize the present; trust tomorrow e’en as little as you may (Conington 1872) • French: Cueille le jour, et ne crois pas au lendemain (De Lisle 1887) • Early Modern French: Jouissez donc en repos du jour present, & ne vous attendez point au lendemain (Dacier 1681) • Italian: tu l’oggi goditi: e gli stolti al domani s’affidino (Chiarini 1916) • Spanish: Coge este dia, dando muy poco credito al siguiente (Campos and Minguez 1783) • Portuguese: colhe o dia, do de amanh ́a mui pouco confiando (Duriense 1807) • German: Pflücke des Tag’s Blüten, und nie traue dem morgenden (Schmidt 1820)

  25. Sense Discovery • SMT based on Brown et al (1990) • Different senses for a word in one language are translated by different words in another. • “Bank” (English) • financial institution = French “banque” • side of a river = French “rive” (e.g., la rive gauche)

  26. Progressive Alignment • Sentence level: Moore’s Bilingual Sentence Aligner (Moore 2002) • aligns sentences that are 1-1 translations of each other w/ high precision (98.5% on a corpus of 10K English-Hindi sentences) • Word level: MGIZA++ (Gao and Vogel 2008) • parallel version of: GIZA++ (Och and Ney 2003) - implementation of IBM Models 1-5.

  27. Multilingual Alignment Word-level alignment of Homer’s Odyssey

  28. Interlinear translations

  29. Interlinear translations

  30. Latin/Greek  English Senses

  31. English  Greek/Latin Senses

  32. Automatic Bilingual Dictionaries http://nlp.perseus.tufts.edu/lexicon

  33. Use #2: Allusion detection • Given a large collection of texts, we can apply computational techniques to look at all pairs of sentences in a collection and determine which are most similar (however we define similarity). --- • “Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation ...” (Martin Luther King, Jr. 1963). • “Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal” (Abraham Lincoln, 1863).

  34. Classical allusion • Arma virumque cano (Vergil, Aeneid 1.1) • (Arms and the man I sing) • μῆνιν ἄειδε θεὰ (Homer, Iliad 1.1) • (Rage sing, goddess) • ἄνδρα μοι ἔννεπε, μοῦσα (Homer, Odyssey 1.1) • (Man me tell, Muse) • Of man’s first disobedience, and the fruitOf that forbidden tree, whose mortal tasteBrought death into the world, and all our woe,With loss of Eden, till one greater ManRestore us, and regain the blissful seat,Sing, heavenly Muse (Milton, Paradise Lost 1.1-6)

  35. Allusion in Latin poetry Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”) Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) • First, we need to identify the variables to look for: what defines similarity?

  36. #1: Identical words Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”)

  37. #2: Word order Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”)

  38. #3: Syntax Arma -que bella edere (Ovid) Arma virumque cano (Vergil)

  39. #4: Meter/phonetic similarity • Ārmă grăvī nŭmĕrō || … • Ārmă vĭrūmqŭe cănō || …

  40. #5: Semantic similarity Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”) Both are about war (violenta bella) and the instruments of war (arma).

  41. Translate traditional variables into computational terms • Identical words = token similarity • Word order = ngram similarity • Syntax = dependency tree similarity

  42. Allusion Discovery • Test corpus of Latin poets from the Perseus digital library. • Data syntactically parsed using McDonald et al’s MSTParser (2005), trained on data from the Latin Dependency Treebank.

  43. Discovery • nulli illum iuvenes, nullae tetigere puellae (Ov., Met. 3.353) • “No youths, no girls touched him.” • idem cum tenui carptus defloruit unguinulli illum pueri, nullae optavere puellae (Cat., Carm. 62) • “This same one withered when plucked by a slender nail; no boys, no girls hope for it.”

  44. Discovery

  45. Arma gravi numero ... • Arma gravi numero violentaque bella parabamEdere ... (Ov., Amores 1.1) • 1. Arma procul currusque virum miratur inanes (.059) (Verg., Aen. 6.651) - “At a distance he marvels at the arms and the shadowy chariots of men” • 2. Quid tibi de turba narrem numeroque virorum (.042) (Ov., Ep. 16.183) - “What could I tell you of the crowd and the number of men?” • 11. Arma virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram (.025) (Aen. 1.1) - “I sing of arms and the man ...

  46. Summary: elements of computational philology

  47. Tomorrow II. Linguistic Annotation of Classical Texts • how traditional (non-computational) scholars in Classical Studies can get involved in digital philological projects.

More Related