120 likes | 362 Views
Linguistic Resources needed by Nuance. Jan Odijk 060528 Cocosda/Write Workshop. Overview. Nuance History Nuance Technologies Nuance Language Coverage Which Languages are needed Which data are needed Advantages. Nuance History. ScanSoft (Digital Imaging) acquired:
E N D
Linguistic Resources needed by Nuance Jan Odijk 060528 Cocosda/Write Workshop
Overview • Nuance History • Nuance Technologies • Nuance Language Coverage • Which Languages are needed • Which data are needed • Advantages
Nuance History • ScanSoft (Digital Imaging) acquired: • Lernout & Hauspie speech divisions (2001) • Philips Speech Processing embedded and network divisions (2002) • Telelogue (2003) • LocusDialog (2003) • SpeechWorks (2004) • Talks (2004) • ART (2005) • Phonetic Systems (2005) • Rhetorical (2005) • MedRemote (2005) • Nuance (2005) company renamed Nuance • Dictaphone (2006)
Nuance Technologies • Digital Imaging • Speech Technologies • Text-to-Speech (TTS) • Automatic Speech Recognition (ASR) • Dictation • Speaker Verification • Audiomining • Speech Applications/Solutions • Automated Attendant Systems • Directory Assistance Systems • Dictation end-user application • Multimodal applications
Nuance Technologies • Platforms • Server • DeskTop • Embedded • Automotive • Mobile Phones • Domains • Horizontal • Vertical • Medical • Legal • Navigation • ....
Nuance Language Coverage • Broad language coverage • OCR supports 114 languages • DeskTop Dictation in 8 languages • TTS > 23 languages • Telephony ASR > 40 languages • Embedded ASR > 11 languages • Broad language coverage necessary • Most business customers are operating internationally • Want a single provider of language and speech technologies
Nuance Language Coverage • Language Coverage must be further broadened! • Data are needed for that, but ... • Costs are high • No single company can afford the investments
Which Languages? • Priority 1 • Arabic, Chinese (Mandarin, Cantonese), Danish, Dutch, English (UK), English (US), Farsi, Finnish, French, French (Canadian), German, Hindi, Indonesian, Italian, Malaysian, Pilipino (Tagalog), Polish, Portuguese, Portuguese (Brazil), Russian, Spanish, Spanish (American), Swedish, Thai, Turkish, Vietnamese,... • Priority 2 • Bulgarian, Croatian, Czech, Estonian, Greek, Gujarati, Hebrew, Hungarian, Icelandic, Japanese, Kannada, Kazak, Khmer, Latvian, Lithuanian, Macedonian, Malayalam, Marathi, Norwegian, Punjabi Romanian, Serbian, Sesotho, Sinhalese, Slovak, Slovenian, Swahili, Tamil, Telugu, Ukrainian, Urdu, Uzbek, Xhosa, Zulu,...
Which Data? • There’s not Data but More Data • but... • Given Time and Costs constraints a minimal set is needed to develop technologies/applications for new languages
Which Data? • Network ASR: SpeechDat family • SpeechDat-II, Orientel, SALA (I and II), LILA • Embedded ASR • Automotive: SpeechDat-Car • Consumer Apps: SPEECON • Pronunciation and Grammatical Lexicons: LC-STAR • TTS synthesis: TC-STAR • see • http://www.speechdat.org • http://www.tc-star.org • http://www.lc-star.com
Which Data? • Desktop Office data • Large Text Corpora (>300 million tokens plain text) • news • business / finance • traffic messages, weather messages • e-mail • SMS • ...
Advantages • Research can be done in your own language • Part of the costs can be recovered by licensing data via ELRA to companies • Companies can develop technologies/applications for your languages • Contributes to securing the position of your language in the Internet era • Ask your government for funding and support • Some good examples: • STEVIN Programme Netherlands/Flanders • UPC databases for Catalan (Asunción Moreno)