Lost Language Decipheration

Lost Language Decipheration KovidKapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 ShaunakChhaparia– 07005019

Outline • Examples of ancient languages which were lost • Motivation : Why should we bother about such languages? • The Manual process of Decipheration • Motivation for a Computational Model • A Statistical Method for Decipheration • Conclusions

What is a "lost" language • A language is said to be “lost” when modern scholars cannot reconstruct text written in it. • Slightly different from a “dead” language – a language which people can translate to/from, but noone uses it anymore in everyday life. • Generally happens when one language gets replaced by another. • For eg, native American languages were replaced by English, Spanish etc.

Examples of Lost Languages • Egyptian Hieroglyphs • A formal writing system used by ancient Egyptians, containing of logographic and alphabetic symbols. • Finally deciphered in the early 19th century, following a lucky finding of “Rosetta Stone”. • Ugaritic Language • Tablets with engravings found in the lost city of Ugarit, Syria. • Researchers recognized that it is related to Hebrew, and could identify some parallel words.

Examples of Lost Languages (cont.) • Indus Script • Written in and around Pakistan around 2500 BC • Over 4000 samples of the text have been found. • Still not deciphered successfully! • What makes it difficult to decipher? http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg

Motivation for Decipheration of Lost Languages • Historical knowledge expansion • Very helpful in learning about the history of the place where the language was written. • Alternate sources of information : coins, drawings, buried tombs. • These sources not as precise as reading the literature of the region, which gives a clear idea. • Learning about the past explains the present • A lot of the culture of a place is derived from ancient cultures. • Boosts our understanding of our own culture.

Motivation for Decipheration of Lost Languages(cont.) • From a linguistic point of view • We can figure out how certain languages were developed through time. • Origin of some of the words explained.

The Manual Process • Similar to a cryptographic decryption process • Frequency analysis based techniques used • First step : identify the writing system • Logographic, alphabetic or syllaberies? • Usually determined by the number of distinct symbols. • Identify if there is a closely related known language • Hope for finding bitexts : translations of a text of the language in a known language, like Latin, Hebrew etc. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Examples of Manual Decipherment : Egyptian Hieroglyphs • Earliest attempt made by Horapollo in the 5th century. • However, explanations were mostly wrong! • Proved to be an impediment on the process for 1000 years! • Arab historians able to partly decipher in the 9th and 10th centuries. • Major Breakthrough : Discovery of Rosetta Stone, by Napolean’s troops. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Examples of Manual Decipherment : Egyptian Hieroglyphs • The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek! • Finally deciphered in 1820 by Jean-François Champollion. • Note that even with the availability of a bitext, full decipheration took 20 more years! http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Rosetta_Stone_BW.jpeg/200px-Rosetta_Stone_BW.jpeg

Examples of Manual Decipherment : Ugaritic • The inscribed words consisted of only 30 distinct symbols. • Very likely to be alphabetical. • The location of the tablets found suggested that it is closely related to Semitic languages • Some words in Ugaritic had the same origin as words in Hebrew • For eg, the Ugaritic word for king is the same as the Hebrew word. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Examples of Manual Decipherment : Ugaritic (cont.) • Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”! • Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script! http://knp.prs.heacademy.ac.uk/images/cuneiformrevealed/scripts/ugaritic.jpg

Conclusions on the Manual Process • Very time taking exercise; years, even centuries taken for the successful decipherment. • Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.

Need for a Computerised Model • Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings? • Can the knowledge of a closely related language be used to decipher a lost language? • If possible, would save a lot of efforts and time. • Successful archaeological decipherment has turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess.– Andrew Robinson

Recent attempts : A Statistical model • Notice that manual efforts have some guiding principles • A common starting point is to compare letter and word frequencies with a known language • Morphological analysis plays a crucial role as well • Highly frequent morpheme correspondences can be particularly revealing. • The model tries to capture these letter/word level mappings and morpheme correspondences. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Problem Formulation • We are given a corpus in the lost language, and a non-parallel corpus in a related language from the same family. • Our primary goals : • Finding the mapping between the alphabets of the lost and known language. • Translate words in the lost language into corresponding cognates of the known languages http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Problem Formulation • We make several assumptions in this model : • That the writing system is alphabetic in nature • Can be easily verified by counting the number of symbols in the found record. • That the corpus has been transcribed into an electronic format • Means that each character is uniquely identified. • About the morphology of the language : • Each word consists of a stem, prefix and suffix, where the latter two may be omitted • Holds true for a large variety of human languages

Problem Formulation • The inventories and the frequencies in the known language are given. • In essence, the input consists of two parts : • A list of unanalyzed words in a lost language • A morphologically analyzed syntax in a known related language

Intuition : A toy example • Consider the following example, consisting of words in a lost language closely related to English, but written using numerals. • 15234 --asked • 1525 --- asks • 4352 --- desk • Notice the pair of endings, -34 and -5, with the same initial sequence 152- • Might correspond to –ed and –s respectively. • Thus, 3=e, 4=d and 5=s

Intuition : A toy example • Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk. • As this example illustrates, we proceed by discovering both character- and morpheme-level mappings. • Another intuition the model should capture is the sparsity of the mapping. • Correct mapping will preserve phonetic relations b/w the two related languages • Each character in the unknown language will map to a small number of characters in the related language.

Model Structure • We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language • The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other. • The solution: Using a Dirichlet Process to model probabilities (explained further). http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Model Structure (cont…) • There are four basic layers in the generative process • Structural Sparsity • Character-edit Distribution • Morpheme-pair Distributions • Word Generation

Model Structure (cont…) Graphical overview of the Model http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Step 1 : Structural Sparsity • We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse. • The set of edit operations include character substitutions, insertions and deletions. We assign a variable λecorresponding to every edit operation e. • The set of character correspondences with the variable set to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid correspondences. • We define a joint prior over these variables to encourage sparse character mappings.

Step 1 : Structural Sparsity (cont.) • This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1) • For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix.Formally, c(u) = ∑h λ(u,h) • We now define a function fi = max(0, |{u : c(u) = i}| - bi)For any i other than 1, fi should be as low as possible. • Now the probability of this matrix is given by

Step 1 : Structural Sparsity (cont…) • Here Z is the normalization factor and w is the weight vector. • wi is either zero or negative, to ensure that the probability is high for a low value of f. • The values of bi and wi can be adjusted depending on the number of characters in the lost language and the related language.

Step 2 : Character-Edit Distribution • We now draw a base distribution G0 over character edit sequences. • The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λe,and a function depending on the number of insertions and deletions in the sequence, q(#ins(e), #del(e)). • The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.

Step 2 :Character-Edit Distribution (cont.) Example: Average Ugaritic word is 2 letters longer than an average Herbew word Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4 • The part depending on the λesmakes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)

Step 3 : Morpheme Pair-Distributions • The base distribution G0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions. • The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution. • Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α

Step 3 : Morpheme Pair-Distributions (cont.) • Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution Gstm for the stem, we maintain separate distributions Gsuf|stm and Gpre|stm for each possible stem part-of-speech.

Step 4 : Word Generation • Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated. • Based on some prior, we first decide if a word in the lost language has a cognate in the known language. • If it does, then a cognate word pair (u, h) is produced: • Otherwise, a lone word u is generated.

Summarizing the Model • This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language. • An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0. • As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions. • Also, the character-level mappings obey sparsity constraints

Results of the process • Applied on Ugaritic language • Undeciphered corpus contains 7,386 unique word types. • The Hebrew Bible used for known language corpus, which is close to ancient Ugaritic. • Assume morphological and POS annotations availability for the Hebrew lexicon. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Results of the process • The method identifies Hebrew cognates for 2,155 words, covering almost 1/3rd of the Ugaritic vocabulary. • The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates • This method correctly translates 60.4 % of all cognates. • This method yields correct mapping for 29 out of 30 characters.

Future Work • Even with character mappings, many words can be correctly translated only by examining their context. • The model currently fails to take the contextual information into account. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Conclusions • We saw how language decipherment is an extremely complex task. • Years of efforts required for successful decipheration of each lost language. • Depends on the amount of available corpus in the unknown language. • But availability does not make it easy. • Statistical model has shown promise. • Can be developed further and used for more languages.

References • Wikipedia article on Decipherment of Hieroglyphs http://en.wikipedia.org/wiki/Decipherment_of_hieroglyphic_writing • Lost Languages: The Enigma of the World's Undeciphered Scripts by Andrew Robinson (2009) http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/non-fiction/article5859173.ece • A Statistical Model for Lost Language Decipherment Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL (2010) (http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf)

References • A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script • Wade Davis on Endangered Cultures (2008) http://www.ted.com/talks/wade_davis_on_endangered_cultures.html

Lost Language Decipheration

Lost Language Decipheration

Presentation Transcript

Paradise Lost

ALMOST LOST

LOST

Paradise Lost

Lost Son

 Lost Dog 

LOST

LOST

Lost dog

Paradise Lost

Lost

Paradise Lost

Paradise Lost

Paradise Lost

“Lost” Cases

Lost

LOST

Lost in Translation The Language of Research Administration

Lost?

The Lost World The Lost Island The Lost Dinosaur The Lost Genome

Lost!

LoST