1 / 22

The 2003 TIDES Surprise Language Exercise

The 2003 TIDES Surprise Language Exercise. Douglas W. Oard University of Maryland. Outline. Thinking out of the box Some results Lesson Learned. Surprise Language Framework. Zero-resource start (treasure hunt) Time constrained (10 or 29 days) English Users / Documents in language X

Download Presentation

The 2003 TIDES Surprise Language Exercise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland CLEF 2003

  2. Outline • Thinking out of the box • Some results • Lesson Learned

  3. Surprise Language Framework • Zero-resource start (treasure hunt) • Time constrained (10 or 29 days) • English Users / Documents in language X • Character-coded text • Research-oriented • Intensely collaborative (team-based)

  4. Cebuano Announce: Mar 5 Test Data: Stop Work: Mar 14 Newsletter: April Talks: May 30 (HLT) Papers: Hindi Jun 1 Jun 27 Jun 30 August Aug 5 (TIDES PI) Aug 15 (TALIP) Schedule

  5. Cebuano and Hindi ISI Maryland NYU Johns Hopkins Sheffield LDC CMU UC Berkeley MITRE Hindi Only U Mass Alias-i BBN IBM CUNY KAT SPAWAR 16 Participating Teams

  6. Five evaluated tasks • Automatic CLIR (English queries) • Topic tracking (English examples, event-based) • Machine translation into English • English “Headline” generation • Entity tagging (five MUC types) • Several useful components • POS tags, morphology, time expressions, parsing • Several demonstration systems • Interactive CLIR (two systems) • Cross-language QA (English Q, Translated A) • Machine translation (+ Translation elicitation) • Cross-document entity tracking

  7. Hindi Participants

  8. Research Results Innovation Cycle Systems Coordination Resource Harvesting Strategy Push Organize Talk Capture Process Knowledge Translation Detection Extraction Summarization People Corpora Web Books Web Lexicons Books Time

  9. The Synchronization Challenge

  10. Cebuano MT Results Bible Cebuano book Dict Melamed News

  11. Cebuano Interactive CLIR • Starting Point: iCLEF 2002 system (German) • Interface: “synonyms”/examples (parallel)/MT • Back end: InQuery/Pirkola’s method • 3-day porting effort • Cebuano indexing (no stemming) • One-best gloss translation (bilingual term list) • Informal Evaluation • 2 Cebuano native speakers (at ISI)

  12. Hindi syntax is generally very “regular” • Subject – Object – Verb is the preferred order • JohnsawMary. = जॉननेमेरीकोदेखा। • Presence of (occasionally deleted) case markers often permit reordering • JohnsawMary. = मेरीको जॉननेदेखा। • English (or western) punctuation is pervasive in many modern texts • John said,“ I am here ” = जॉननेकहा,“ मैंयहाँहूँ” • The subject may be omitted in some contexts • A: WhereisJohn? B: [He]wenthome. • अ: जॉनकहाँहै? ब: [वह] घरचलागया।

  13. Hindi Encoding • Text encoding for storage and transmission and text rendering for display and printing are separated • Which syllable constituents get their own code-points? • Several 8-bit encodings: • After assigning a code point to each stand-alone vowel and full consonant, and to half-consonants and vowels within a syllable, spare code-points get used for assorted/frequent CC clusters. • Unicode UTF-16: Only stand-alone vowels, full consonants and vowels within syllables have their own code-points. All half consonants are realized by a `full consonant + halant’ sequence • Choice of the “grammar” for syllable construction and rendering? • Several 8-bit encodings write the code-points in display order, simplifying the rendering program • Unicode writes it in pronunciation order, making for a considerably more complex display program

  14. Hindi Week 1: Porting • Monday • 2,973 BBC documents (UTF-8) • Batch CLIR (no stem, 2/3 known items rank 1) • Tuesday • MIRACLE (“ITRANS”, gloss) • Stemmer (implemented from a paper) • Wednesday • BBC CLIR collection (19 topic, known item) • Friday: • Parallel text (Bible: 900k words, Web: 4k words) • Devanagari OCR system

  15. Hindi Weeks 2/3/4: Exploration • N-grams (trigrams best for UTF-8) • Relative Average Term Frequency (Kwok) • Scanned bilingual dictionary (Oxford) • More topics for test collection (29) • Weighted structured queries (IBM lexicon) • Alternative stemmers (U Mass, Berkeley) • Blind relevance feedback • Transliteration • Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)

  16. Formative Evaluation

  17. Transliteration • Importance: Names, loan words • दक्षिण कोरिया(Dakshin Korea) • Pronunciation crosswalk English->Hindi • English pronunciation (Festival) • Overgenerate Hindi characters (hand-built rules) • Doctor => d aa k t ax r OR d ao k t ax r • Rank n-best using bigrams (Hindi name list) • Treat as alternate translations for CLIR • Pirkola’s method

  18. Some Challenges • Formative evaluation • Synchronize variable-rate efforts • Soccer, not football • Integration • Capturing lessons learned • See the forest, not just the trees

  19. For More Information • TIDES Newsletter • Cebuano: April • Hindi: August • Papers • NAACL/HLT Short paper • MT Summit (late Sep) • ACM TALIP Special Issue • Demonstration systems • Contact individual sites

More Related