LANGUAGE TECHNOLOGIES: Linguistics in the Computer World

LANGUAGE TECHNOLOGIES:Linguistics in the Computer World Darinka Verdonik

Digital/computer age • Living in a digital, computer age: • we are surrounded by digital machines: PCs, audio/video devices (MP3, DVD, CD), e-banking, domestic appliances (microwave oven, washing machine)... • How can a linguist use PC? • searching for information & knowledge on Internet • buying books, making reservations (eg. library)... • contacts: e-mail, mailing lists, messenger, forum, chat-rooms... • tools: writing and designing texts, preparing presentations, posters; e-dictionary, corpus, spell-checker, grammar checker, automatic summarization...

Products • Speech synthesiser:Plattos • pronounces written text with a human-like voice

Products • Speech recogniser:Broadcast News subtitle • writes down spoken text in a TV Broadcast News show

Products • Dialogue system:Auto Attendant • human-machine communicationin automatic telephone responder System:Pozdravljeni na portalu FERI. Izberite imenik ali oddelek. Caller:Imenik. System:Izbrali ste imenik. Izgovorite ime in priimek osebe. Caller: Darinka Verdonik.. System:Izbrali ste Darinko Verdonik. Prosim, počakajte trenutek.

Products • Dialogue system: Klepec

Products • Machine translation: • translates text-to-text (just translation): Presis • translates speech-to-speech (recognition – translation – synthesis): Babilon, VoiceTran

Products

Products • National corpus:Fida/FidaPlus (www.fidaplus.net), Nova beseda (bos.zrc-sazu.si), BNC (http://www.natcorp.ox.ac.uk/)...

Products • National corpus: FidaPlus (www.fidaplus.net)

Products • Parallel corpus: Evrokorpus (http://www.gov.si/evrokor/)

Algorithms • The heart of the technologies: programming, modeling, coding...

Algorithms Examples: • Speech synthesis: • grapheme-to-phoneme conversion • modeling prosodic features • searching algorithm(s) • Speech recognition: • acoustic modeling – calculating probabilities of phonemes (triphones) • language modeling – calculating probabilities of word order

Language resources – spoken • Databases of spoken language: • define the type and number of texts to include: read phrases and/or sentences, speech in media (TV, radio), conversational speech... • recording • defining contextual tags: speakers (gender, dialect...), acoustic environment (channel, background, noises...), non-speech sounds (breathing, laughing...)... • defining linguistic tags: phonetic/orthographic transcription, lemma, POS and other morpho-syntactic tags... • segmentation, transcription (phonetic or orthographic) • annotation • coding (computer-readable form, eg. XML) • optional: developing user interface for searching through the database

Language resources – spoken

Language resources – spoken <?xml version="1.0" encoding="ISO-8859-2"?> <!DOCTYPE Trans SYSTEM "trans-13.dtd"> <Trans scribe="Darinka" audio_filename="HOha50" version="25" version_date="051201"> <Topics> <Topic id="to1" desc="jedro"/> <Topic id="to2" desc="uvod"/> <Topic id="to3" desc="zakljucek"/> </Topics> <Speakers> <Speaker id="spk1" name="Habakuk_receptor1" check="yes" type="female" dialect="native" accent="p-mariborsko" scope="local"/> <Speaker id="spk2" name="klicatelj39" check="yes" type="female" dialect="native" accent="p-celjsko" scope="local"/> </Speakers> <Episode> <Section type="report" topic="to2" startTime="0" endTime="6.26"> <Turn startTime="0" endTime="2.246" speaker="spk1" mode="spontaneous" fidelity="medium" channel="telephone"> <Sync time="0"/> dobro hotel Habakuk [ime] pri telefonu </Turn> <Turn speaker="spk2" mode="spontaneous" fidelity="medium" channel="telephone" startTime="2.246" endTime="5.424"> <Sync time="2.246"/> ja <Event desc="marker" type="lexical" extent="previous"/> dober dan ľelim [priimek] [ime] je moje ime </Turn>

Language resources – spoken

Spoken corpora of the Slovenian language • BNSI Broadcast News (36 hours) • Slovenian Broadcast News Database (30 hours = 255,000 words) • Korpus govorjene slovenščine (90 min. = 15,000 words) – pilot corpus • Turdis (100 min. = 15,000 words) • ...

Language resources – written • Corpora – huge e-collections of different texts (books, journals...): • define the type and number of texts to include: national/reference corpora, domain specific corpora, parallel corpora... • defining contextual tags: source, year of publication, language... • defining linguistic tags: lemma, POS, morpho-syntactic tags, phonetic transcription... • annotating • coding • optional – user interface for searching through the corpus

Language resources – written <text lang="en-sl" id="orwl.T"> <body> <tu lang="en-sl" id="orwl.1"> <seg lang="en"> <s id="Oen.1.1.1.1"><w>It</w> <w>was</w> <w>a</w> <w>bright</w> <w>cold</w> <w>day</w> <w>in</w> <w>April</w><c>,</c> <w>and</w> <w>the</w> <w>clocks</w> <w>were</w> <w>striking</w> <w>thirteen</w><c>.</c></s> </seg> <seg lang="sl"> <s id="Osl.1.2.2.1"><w lemma="biti" function="Vcps-sma">Bil</w> <w lemma="biti" function="Vcip3s--n">je</w> <w lemma="jasen" function="Afpmsnn">jasen</w><c>,</c> <w lemma="mrzel" function="Afpmsnn">mrzel</w> <w lemma="aprilski" function="Aopmsn">aprilski</w> <w lemma="dan" function="Ncmsn">dan</w> <w lemma="in" function="Ccs">in</w> <w lemma="ura" function="Ncfpn">ure</w> <w lemma="biti" function="Vcip3p--n">so</w> <w lemma="biti" function="Vmps-pfa">bile</w> <w lemma="trinajst" function="Mcnpnl">trinajst</w><c>.</c></s> </seg>

Language resources – written

Language resources – written • Lexica – e-collections of words, usually with linguistic information added: • selecting word entries and preparing a word list • defining types of information included: lemma, POS, morpho-syntactic tags, phonetic transcription, semantic nets/word nets... • annotating • coding • optional – user interface for searching through the lexicon

Language resources – written <ENTRYGROUP orthography="Abitanti"> <ENTRY> <NOM class="CIT" /> <LEMMA>Abitanti</LEMMA> <PHONETIC>a - b i - " t a: n - t i</PHONETIC> </ENTRY> <ENTRY> <NOM class="STR" /> <LEMMA>Abitanti</LEMMA> <PHONETIC>a - b i - " t a: n - t i</PHONETIC> </ENTRY> </ENTRYGROUP>

Corpus linguistics Uses corpora for it’s researches. Advantages: • Analysis of real texts that were actually written/spoken. • Ability to handle a huge amounts of data – automatic searching, counting, sorting... • Statistical reliability – eg. results of analysis can be calculated in %.

Corpus linguistics Includes: • Building corpora: • what types and what amount of texts to include • what linguistic information to include • Developing tools for automatic search, sorting and counting. • Corpus analysis.

Corpus linguistics Example of corpus analysis (Gorjanc, V., 2005. Uvod v korpusno jezikoslovje. Domžale, Izolit.)

Corpus linguistics • Usability of corpus in everyday work – similar as dictionary, with advantage of being up-to-date: • when writing or correcting texts, we can search for a word/phrase and see: • how often it is usually used • in what type of texts it is used • how it is usually used • what meaning does it has in a context • what are the most common collocations • what are the most common translations • etc.

Conclusions • Linguistics in a computer world: • co-operates in a process of technological development, results of which (if successful) will effect our everyday future (machine-mediated communication, human-machine communication, helping handicapped people) • uses the products of technological development for achieving higher reliability of the researches, to develop new methods of research and new linguistic tools

Thank you for your attention. Questions? Slides available on: http://www.elektronika.uni-mb.si/Elektronika/Slo/staff/Staff_slo.php

LANGUAGE TECHNOLOGIES: Linguistics in the Computer World