110 likes | 262 Views
Where do we stand?. Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001. Part I: the other 6,000+ languages. LE R&D has focussed on a dozen or so languages of major commercial interest
E N D
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001
Part I: the other 6,000+ languages • LE R&D has focussed on a dozen or so languages of major commercial interest • Many other languages equally “deserving” • Not just MT but large range of LE resources needed
Which languages? • “Minority” languages • NIMLs (non-indigenous minority languages) • Immigrants • Refugees • Asylum seekers • E.g. Languages of Indian subcontinent, and Africa • Hardly “minority” languages
Example of Hindi • 180 million speakers in India • Spoken as first language in Northern States • 400-700 million speakers worldwide • 450,000 speakers in Britain • Hindi-Urdu - if taken together (!) #2 in world, ahead of English
Translation software - What would you expect? • Word processing • Fonts • Hyphenation • Spell checker • Style checker • Mono-/bilingual on-line dictionary • Multi-lingual on-line dictionary • Thesaurus (i.e. synonym dictionary) • Terminology • Translation memory • Computer-aided translation • MT
Translation software - what is available? • Word processing • Fonts • But not much else
What can we do about it? • Long term: computational linguistics research on a wider variety of languages • Short term: make use of existing resources (corpora, MRDs, web pages) and extract linguistic data from them
Part II: India - the forgotten jewel • Three visits to India earlier this year • MT workshop in Kanpur • NLP workshop in Kolkata • Anglo-Indian summit in Mumbai • Several major groups working on NLP, including MT • Government initiatives
India’s problem • 13 official languages • Using 6 different writing systems • Special status of English • Widespread low levels of literacy • Introspective focus vis a vis interlingual communication
Problems being addressed • Agreed exchange formats for writing systems • OCR for writing systems • Speech recognition (including Indian English) • Word processing and related packages (dictionaries, spell checkers)
Contd. • Terminology • Corpus collection • MT and CAT tools • English <> Hindi • Between Indian languages