1 / 12

Building and Using an Inuktitut-English Parallel Corpus

Building and Using an Inuktitut-English Parallel Corpus. Joel Martin, Howard Johnson, Benoit Farley & Anna Maclachlan <firstname>.<lastname>@nrc.gc.ca. Agglutinative written form. qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix

lewis
Download Presentation

Building and Using an Inuktitut-English Parallel Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building and Using an Inuktitut-English Parallel Corpus Joel Martin,Howard Johnson, Benoit Farley &Anna Maclachlan <firstname>.<lastname>@nrc.gc.ca

  2. Agglutinative written form qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix qai-, -saali-, -niaq-, -qquu-, -nngit-, -galuaq, -puq “Actually, he probably won’t come early today.”

  3. Nunavut Hansards • 155 days of Nunavut Legislative Assembly • April 1, 1999 to November 1, 2002

  4. Difficulties Aligning Inuktitut Hansards No spelling checkers Many dialects (translators) “School”: ilinniarvik, ilisavik, ilinniaqvik, ilitarvik, ilinniavik Words 1:1 Word alignment is not usually possible No root dictionary for Eastern Canada Lengths Aligning by length in Words not a good idea Aligning by length in Chars: average =1.05

  5. Alignment Techniques • Length Alignment: (Gale and Church, 1993) • Gaussian to estimate matching probability • Dynamic programming to optimize the match • Lexical Alignment: • non-alphabetic sequences (9:00, 42-1(1) and 1999) • 8 reliable word correspondences • speaker/uqaqti • motion/pigiqati

  6. Initial Alignment Results

  7. Is the alignment useful? • Term Dictionary • Few contemporary dictionaries • Few with roots and suffixes (Eastern Arctic) • Spelling differences, Dialectical differences • Examples: • -kiaq “don’t know” • tukisi- “understand” • -juma- “want” • maligaliur(vi)- “assembly” • piita “Peter” • kanata- “Canada” • makalain “McLean”

  8. What is a term? • Inuktitut Terms • Words, phrases of 2 to 4 words • Prefixes, internal substrings, final substrings < 10 ch. • English Terms • Words, phrases of 2 to 4 words • Prefixes All against all • Consider every Inuktitut term to every English term • Slow with big files of partial results

  9. Consistent Translations Pr(I&E)PMI = log Pr(I)*Pr(E) Confidence Interval around Ratios (95%)

  10. Glossary Results 4362 term pairs 72.3% of English word occurrences (but…) Exact Matches(43%): a) half were uninflected proper nouns. b) inuup and person’s. Good (more in the Inuktitut) Matches(44%): pigiaqtitara and deal. “I deal with him”.

  11. Summaryhttp://www.InuktitutComputing.ca/NunavutHansard/en/ Sentence alignment of an agglutinative language. Use of the sentence alignment to build a glossary. -lauqsimanngit-“have never” inuliriji-“social worker” -kiaq“don’t know” nuu juak“New York” tusaumajjutilirinirmut kanngunaqtulirinirmullu (kamis-) “Information and Privacy Commissioner”

More Related