120 likes | 229 Views
Building and Using an Inuktitut-English Parallel Corpus. Joel Martin, Howard Johnson, Benoit Farley & Anna Maclachlan <firstname>.<lastname>@nrc.gc.ca. Agglutinative written form. qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix
E N D
Building and Using an Inuktitut-English Parallel Corpus Joel Martin,Howard Johnson, Benoit Farley &Anna Maclachlan <firstname>.<lastname>@nrc.gc.ca
Agglutinative written form qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix qai-, -saali-, -niaq-, -qquu-, -nngit-, -galuaq, -puq “Actually, he probably won’t come early today.”
Nunavut Hansards • 155 days of Nunavut Legislative Assembly • April 1, 1999 to November 1, 2002
Difficulties Aligning Inuktitut Hansards No spelling checkers Many dialects (translators) “School”: ilinniarvik, ilisavik, ilinniaqvik, ilitarvik, ilinniavik Words 1:1 Word alignment is not usually possible No root dictionary for Eastern Canada Lengths Aligning by length in Words not a good idea Aligning by length in Chars: average =1.05
Alignment Techniques • Length Alignment: (Gale and Church, 1993) • Gaussian to estimate matching probability • Dynamic programming to optimize the match • Lexical Alignment: • non-alphabetic sequences (9:00, 42-1(1) and 1999) • 8 reliable word correspondences • speaker/uqaqti • motion/pigiqati
Is the alignment useful? • Term Dictionary • Few contemporary dictionaries • Few with roots and suffixes (Eastern Arctic) • Spelling differences, Dialectical differences • Examples: • -kiaq “don’t know” • tukisi- “understand” • -juma- “want” • maligaliur(vi)- “assembly” • piita “Peter” • kanata- “Canada” • makalain “McLean”
What is a term? • Inuktitut Terms • Words, phrases of 2 to 4 words • Prefixes, internal substrings, final substrings < 10 ch. • English Terms • Words, phrases of 2 to 4 words • Prefixes All against all • Consider every Inuktitut term to every English term • Slow with big files of partial results
Consistent Translations Pr(I&E)PMI = log Pr(I)*Pr(E) Confidence Interval around Ratios (95%)
Glossary Results 4362 term pairs 72.3% of English word occurrences (but…) Exact Matches(43%): a) half were uninflected proper nouns. b) inuup and person’s. Good (more in the Inuktitut) Matches(44%): pigiaqtitara and deal. “I deal with him”.
Summaryhttp://www.InuktitutComputing.ca/NunavutHansard/en/ Sentence alignment of an agglutinative language. Use of the sentence alignment to build a glossary. -lauqsimanngit-“have never” inuliriji-“social worker” -kiaq“don’t know” nuu juak“New York” tusaumajjutilirinirmut kanngunaqtulirinirmullu (kamis-) “Information and Privacy Commissioner”