Language and Speech Technology: Introduction

Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011

Overview • What is language and speech technology (LST)? (3-7) • Major Subfields of LST (8-25) • Characterization of the last 30 years (26-27) • 80s (28-36), 90s (37-49), 00s (50-56) • Current Status (57-69) • CLARIN infrastructure (70-75) • This week’s programme (76)

Language Technology • Language Technology is the study of computational systems that process natural language • Alternative names: • Human Language Technology (HLT) • Natural Language Processing (NLP)

Speech Technology • Speech Technology is the study of computational systems that process speech • Is a part of Language Technology • Often • Term “Language technology” reserved for the study of computational systems that process written language

Computational Linguistics • Computational Linguistics (CL) is the study of language from a computational perspective • Often used interchangeably with language technology • Often grouped under Artificial Intelligence (AI) , although CL predates AI • AI: the study and design of intelligent systems

Computational Systems • Computational systems to process natural language do not exist naturally (except in the human brain) • They must be designed, implemented, and evaluated • Therefore it is a kind of engineering

Computational Systems • LST is NOT • the study of processing of natural language by humans in • cognition, • (cognitive) psychology, • (psycho)linguistics • phonetics

Language Technology Subfields • Orthographic processing • Text = sequence of characters • Tokenization • Text => sequence of tokens • Token= occurrence of a word form • Relatively simple for languages that uses interpunction (space, dot, comma, etc.) for separating tokens • More difficult for languages such as Chinese, Thai, etc.

Language Technology Subfields • Orthographic processing • Orthographic normalization • Token => (token, normalized token) • Normalized token = canonical orthographic representation for a set of orthographic variants • Examples: • Contemporary spelling variants: aktie => actie • Older spelling variants: vleesch => vlees • Typos: actei => actie • OCR errors: raarn => raam

Language Technology Subfields • Morphological processing • Lemmatization: token => (token, lemma) • Lemma = canonical orthographic representation for an inflectional paradigm • Often ambiguities • Examples • lemma(walked) = walk; Lemma(men) = man • Lemma (graven) = {graf, graaf, graven} (Dutch)

Language Technology Subfields • Morphological processing • Inflection analysis/generation • Word form  (lemma, inflectional features) • Examples: • graven  (graf, PoS=Noun, number=plural) • graven  (graaf, PoS=Noun, number=plural) • graven  (graven, PoS=Verb, form=infinitive) • graven  (graven, PoS=Verb, form= indicative, tense=present, number = plural)

Language Technology Subfields • Morphological processing • Compound processing • word form ((word form,affix?)+, word form) • lemma  ((word form,affix?)+, lemma) • Example: • Vleeskoeienhouders  ([vlees,koeien], houders) ‘meat cow farmers’ • gebiedsbepaling  ([(gebied, s)], bepaling)

Language Technology Subfields • Morphological processing • Derivational morphology processing • word form  (prefix*, lemma, suffix*) • Example: • Characterization ([], characterize, [ation])

Language Technology Subfields • (PoS-)tagging • Assignment of a grammatical tag to a token in context (tag=label for grammatical properties) • Token => (token, tag) in context • Usually assignment of PoS-tags • Often more detailed grammatical (inflectional) tags

Language Technology Subfields • (PoS-)tagging • Context: usually: • Some words and/or tags preceding • Some words following • Examples: • (graven, Zij __ een graf) => Vindprespl • (graven, De __ zijn boos) => Npl

Language Technology Subfields • Chunking • identifying major phrases in a sentence • Example • The man bought a present for his wife => • [NP The man] bought [NP a present] [PP for his wife]

Language Technology Subfields • Parsing • Assign a syntactic structure to a sentence • Example: The man bought a present for his wife => [S [subj/NP The man] [pred/VP bought [obj/NP a present] [pobj/PP for [obj/NP his wife]] ] ]

Language Technology Subfields • Machine Translation • Automatic translation of an input text • Example • The man bought a present for his wife => • L’homme a acheté un cadeau pour sa femme

Language Technology Subfields • Content extraction and processing • Named entity recognition • Question-answering • Information retrieval • Information extraction • Sentiment/ opinion mining • Reasoning/Inference on semantic representation • …

Speech Technology Subfields • Speech Synthesis • Artificial production of human speech • Text => speech • Often called Text-To-Speech (TTS) • TTS system usually contains two components • Grapheme to Phoneme (G2P) component • Text => symbolic speech representation (phonetic representation) • Speech Synthesis component • Symbolic speech representation => speech

Speech Technology Subfields • Speech Synthesis (cont.) • Term Speech Synthesis often reserved for this second component • Meaning => speech • Usually called Speech Generation, or Concept-To-Speech, or Data-to-Speech

Speech Technology Subfields • Speech Recognition • Recognition of human speech • Audio containing speech => text • Often called automatic speech recognition (ASR) • Speech Understanding • Understanding of human speech • Audio containing speech => meaning or action

Speech Technology Subfields • Speaker Recognition • Recognition of a speaker given a speech signal • Speech => person identity • Speaker Verification • Verification of the identity of a person • Speech + claimed identity => Boolean

Speech Technology Subfields • Speech Compression • Reduction of the size of speech representations (speech encoding), or • Time-compression of speech representations (so that they sound faster to the listener)

Related fields • Speech often used in dialogues • Study of spoken dialogues (human-human, human-machine) • Speech often combined with other modalities • Study of Multimodal Interaction • Speech part of an man-machine interface • Study of Human - Machine Interaction

Introduction • Three decades: • “80s”= 1980-1994 • “90s”= 1990-2005 • “00s” = 2000-2011

Overview • 80s: Language Technology • 80s: Speech Technology • 90s Language and Speech Technology • 90s Commercial Activity • 90s Importance of Data • 00s Language and Speech Technology

80s: Language Technology • Focus on MT (in Europe) • Eurotra (Europe) • Rosetta (Philips, Netherlands) • Distributed Translation (BSO, Netherlands)

80s: Language Technology • Linguistic “Research Approach” • Focus on Research • not/less on Technology Development • Knowledge-based approach • hand-crafted lexicons and rules • based on a theory / grammatical formalism • Focus on linguistically interesting complex phenomena • less on phenomena that occur often • not strongly data-driven

80s: Language Technology • Focus on an idealized language • not on actual language use • no focus on robustness • Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms • no focus on developing a working system • no pragmatic solutions

80s: Language Technology • Little formal (quantitative) evaluation • only with test suites • constructed sentences illustrating linguistic phenomena • E.g. the HP Test Suite (Flickinger et al. 1987) • computational linguistics rather than language technology

80s: Language Technology Major Problems (from a technology point of view): • Ambiguity • Real • Temporary • Computational Complexity • computation-intensive grammar formalisms • Complexity of language • handcrafting lexicons and rules • requires linguistic and computational expertise • requires a lot of effort and time

80s: Language Technology • Major problems (cont.): • Idealized Language v. actual Language Use • Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains

80s: Speech Technology • Automatic Speech Recognition (ASR) • Statistical “Engineering Approach” • approach based on Noisy Channel Model • derive acoustic models from a lot of annotated speech examples • derive statistical language models from large text corpora (n-gram probabilities)

80s: Speech Technology • Focus on making (small) working systems • Statistical approach: system uses probabilities derived from data • Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks

80s: Speech Technology • Focus on real language use under realistic conditions • Progress made by making concrete systems and evaluating them rigorously

90s: Language Technology • Statistical MT • derive language models from monolingual corpora (probabilities of word ( sequence)s • align “sentences” with their translations • derive translation model from parallel corpora: • estimate translation probabilities for words and word sequences from the aligned “sentences” • use these probabilities to compute translations for new “sentences”

90: Language Technology • Ambiguity: resolved by probabilities based on statistics • Computational Complexity • computationally feasible formalisms • proven in speech recognition • Complexity of language • language and translation model automatically derived from data • Strong focus on actual language use • Highly data driven • Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available

90s: Language Technology • Rise of Internet • increasing need for information retrieval • approximated by search for word and word sequence strings • Information Retrieval • strongly statistically based • Limited linguistics • formal evaluation (recall, precision, F-score)

90s: Language Technology • Resulted in • strongly data-driven approach in language technology • increasing use of machine learning techniques • explicit focus on formal, esp. quantative evaluation • re-examination of simpler/computationally less intensive formalisms (finite-state) for syntax

90s: Speech Technology • Continued working under the established paradigm • increasingly improving performance and extending environments and application areas

90s: Companies • many companies active in Speech technology • IBM, Microsoft, Siemens, Nokia, Philips, Motorola, Matra Nortel, Nortel,.. • Dragon, Kurzweil, Lernout & Hauspie, SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ...

90s: Companies • many companies in Language technology • IBM, Microsoft, INSO, Novell, ... • GMS, Apptek, Globalink, Lernout & Hauspie, Systran, LANT (Xplanation), ...

90s: Companies • MT systems: • knowledge based systems, • developed under an engineering approach • grammatical formalism simple or pruning in search space • to reduce ambiguity • to reduce computational resource requirements • to reduce hand-crafting of rules

90s: Companies • resulted in low quality MT systems • still useful in many circumstances • Differentiating factors • rapid adaptation to (multi-word) terms / vocabulary of new domain • good performance on named entity recognition

90s: Data • Knowledge Based NLP realized cooperation on lexicons was required • ASR Methodology requires a lot of data: • “There is no data like more data” • This led to • Data creation projects • Set-up of data distribution centers • Projects for developing standards for data

90s: Data • Projects • Lexicon projects • Multilex, • Genelex • Acquilex • Parole • WordNet, EuroWordNet • SpeechDat projects • SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON, Orientel • National / Local projects • Spoken Dutch Corpus (Netherlands and Flanders)

90s: Data • Data distribution Centers are set up • LDC (1993) • ELRA (1995) • Standards: • TEI for text corpora • CES, XCES • Eagles, ISLE for grammatical properties

Automating Data Production • Usually existing (imperfect) tools are used to create data (semi-)automatically • G2P for creating phonetic dictionaries • PoS-tagging for PoS-tagged text corpora • Parsers for treebanks • For bootstrapping annotations • Faster and more consistent results • Followed by (partial) manual correction

00s • Early 00s • Many data and research initiatives, nationally • Netherlands • IMIX 2001-2008 • STEVIN 2004-2011 • TST-Centrale (HLT Agency) 2005-.. • France • EVALDA • Technolangue

Language and Speech Technology: Introduction