100 likes | 265 Views
Context: Project NICE. Very low-density languages (e.g. Mapudungun, Inupiaq, Siona,…) Minimal amount of parallel text (< 100K words) No standard orthography/spelling No available trained linguists Access to native informants possible Minimize development time and cost
E N D
Context: Project NICE • Very low-density languages (e.g. Mapudungun, Inupiaq, Siona,…) • Minimal amount of parallel text (< 100K words) • No standard orthography/spelling • No available trained linguists • Access to native informants possible • Minimize development time and cost • Target: functional but rudimentary MT
Generalized EBMT Parallel text 50K-2MB (uncontrolled corpus) Rapid implementation Proven for major L’s with reduced data Transfer-rule learning Elicitation (controlled) corpus to extract grammatical properties Seeded version-space learning Two Technical Approaches
Architecture Diagram SL Input Run-Time Module Learning Module SL Parser EBMT Engine Elicitation Process SVS Learning Process Transfer Rules Transfer Engine TL Generator User Unifier Module TL Output
EBMT Example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English:The tallest man is my father. Mapudungun:Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meetthe tallest man Mapudungun (new):Ayükefun trawüaelChi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.
Version Space Learning • Symbolic learning from + and – examples • Invented by Mitchell, refined by Hirsch • Builds generalization lattice implicitly • Bounded by G and S sets • Worse-case exponential complexity (in size of G and S) • Slow convergence rate
Seeded Version Spaces • Generate concept seed from first + example • Generalization-level hypothesis (POS + feature agreement for T-rules in NICE) • Generalization/specialization level bounds • Up to k-levels generalization, and up to j-levels specialization. • Implicit lattice explored seed-outwards
Complexity of SVS • O(gk) upward search, where g = # of generalization operators • O(sj) downward search, where s = # of specialization operators • Since m and k are constants, the SVS runs in polynomial time of order max(j,k) • Convergence rates bounded by F(j,k)
Next Steps in SVS • Implementation of transfer-rule intepreter (partially complete) • Implementation of SVS to learn transfer rules (underway) • Elicitation corpus extension for evaluation (under way) • Evaluation first on Mapudungun MT (next)
DARPA Redirection for NICE • Focus on technology for rapid deployment of MT for new (low density) languages. • Not interested in indigenous endangered L’s • Somali, Kirgistani, Bahasa, => yes • Siona, US-indigenous, Mapudungun => no • First focus on limited-data evaluation for Major L’s, such as Chinese & Arabic • Statistical methods favored over linguistic.