Eesti keele ressursid keeleteaduse allikmaterjalina

Eesti keele ressursid keeleteaduse allikmaterjalina Kadri Muischnek TÜ

Overview of the talk What are language resources Language corpora what are corpora and what are they good for some corpora of Estonian at UT www.cl.ut.ee/korpused what should be borne in mind while using these corpora user interfaces and facilities, such as user interface www.cl.ut.ee/kasutajaliides morphology-aware user interface www.keeleveeb.ee collocation extraction tool www.rabauti.ee/clc frequency lists www.cl.ut/sagedused1

Motto What is the difference between a dialect and a language? Some time ago: Language is a dialect with an army and a navy (Uriel Weinreich) Nowadays: A language is a dialect with a dictionary, grammar, parser and a multi-million corpus (Lars Borin)

Language resources Language resources – all knowledge sources based on language, including text corpora, lexicons, databases, formal grammar descriptions etc Keeleressursid – elektroonilised andmekogud, sh tekstikogud e korpused, leksikonid, andmebaasid, formaalsed grammatikakirjeldused Eesti keeleressursid vt www.keeleressursid.ee

Language corpora What is a language corpus? Easy answer: an electronic text collection Ideal: polyfunctional electronic text collection that consists of texts that are chosen purposefully to give an full represenation of a language on a certain time span. Polüfunktsionaalne elektroonilisel kujul olev tekstikogu, millesse kuuluvad tekstid on valitud eesmärgipäraselt, nii et nendest koosnev tervik annaks tõepärase pildi kogu keelest (või mingist allkeelest)

Language corpora 2 Size: depends on sublanguage, annotation etc Eesti keele Koondkorpus ca 245 million words Deutsches Referenzkorpus (DeReKo) or Mannheimer corpus ca 5,4 billion words (5,4 miljardit sõna) http://www.ids-mannheim.de/kl/projekte/korpora/

Representativeness and text classes ... texts that are chosen purposefully to give an full represenation of a language (or, at least, a of a certain part of the language) during a certain time span. A corpus meeting this condition is called a representative corpus. Let’s have a look at the text classes of the Brown/LOB-style corpus of Estonian: http://www.cl.ut.ee/korpused/baaskorpus/1980/ Closed vs open corpora Syncronic vs diacronic corpora

Sublanguages Määruse alusel § 3 lõikes 1 nimetatud taotleja kaudu põllumajandustoodete töötlejatele ning taotleja liikmetele, kes ei ole põllumajandustootjad ega põllumajandustoodete töötlejad, antav toetus on vähese tähtsusega abi komisjoni määruse nr 1998/2006, milles käsitletakse asutamislepingu artiklite 87 ja 88 kohaldamist vähese tähtsusega abi suhtes (ELT L 379, 28.12.2006, lk 5–10), mõistes; a olks, sorri offtopicu eest.

Corpora of written Estonian at CL.UT • www.cl.ut.ee/korpused • Baaskorpus – written Estonian of the 1980s • closed, representative, 1 mio words • written Estonian from the period 1890-1990 • closed, partly?? representative, less text classes than in the previous corpus • Koondkorpus (Reference corpus) 1990 – • Open, new texts added constantly, ca 245 mio words at the moment • A subcorpus of the Reference corpus: the Balanced corpus: 5 mio words fiction texts, 5 mio words newspaper texts, 5 mio words science texts

Corpora of written Estonian at CL.UT How one can use these corpora: 1) concordancer http://www.cl.ut.ee/korpused/kasutajaliides/ (a bit slow, regular expressions can be used) 2) download the TEI XML – versions http://www.cl.ut.ee/korpused/segakorpus/ We’ll talk about these later: 3) morphology-aware user interfaces 4) collocation extraction tool

Corpus annotation Annotation: adding some explicit information to the corpus, e.g 1) Structure of the text (paragraphs, sentences, non-text (tables, formulae etc) 2) Morphological information – lemmas, parts-of-speech, grammatical categories 3) Syntactic information - syntactic functions, phrases, the relations between words/phrases 4) Semantic information – word senses, semantic roles, etc etc 5... And much more

Morphological annotation Mees + mees+0 //_S_ sg n, // mesi+s //_S_ sg in, // peeti peet+0 //_S_ adt, sg p, // + pida+ti //_V_ main indic impf imps af // kinni kinni+0 //_D_ //

Morphologically annotated corpora of Estonian http://www.cl.ut.ee/korpused/morfkorpus/ 613 000 words, manually double-checked Can be used via user interface or downloaded (500 000 words) User interface – regular expressions can be used Let’s search for a impersonal verb + postposition poolt www.keeleveeb.ee The whole Koondkorpus (Reference Corpus), 245 mio words Tagged automatically, statistical HMM trigram-based tagger Several systematic errors in the annotation, we are fixing them gradually

Corpus query system at www.keeleveeb.ee Corpus can be queried for a word-form, a lemma, a grammatical category or a combination of those A combination can be adjancency, co-occurence within a sentence or co-occurence within a clause One can ask for occurences of a word/lemma that don’t co-occur with another word/lemma/grammatical category Some examples: (based on fiction subcorpus) Can the noun kala really be used with the sid-ending in pl part? Can the word-form plehku be used without verbs pistma and panema? Let’s again search for a clause containing an impersonal verb and a postposition poolt

Collocations Definitions: In computational linguistics Kollokatsioon statistilises mõttes on sõnade (või sõnavormide) esinemine üksteise naabruses sagedamini, kui võiks eeldada nende endi sageduste põhjal, oletades, et sõnad üldiselt esinevad tekstis juhuslikult. In linguistics mõistetakse kollokatsioone ka kitsamalt – need on sellised sageli koos kasutatavate sõnade ühendid, mis ei mahu idioomi või ka ühendverbi definitsiooni alla.

KOLLOKATSIOONID 2 Probleemiks võõrkeelte õppimisel/kasutamisel: make a decision vs take a decision ?? have a drink vs have an eat ?? tähelepanu pöörama/osutama/suunama/keerama/panema ?? mistõttu neid vajavad nt leksikograafid

Collocations Collocations can be further divided based on semantic compositionality: 1) Idioms (idiomaatilisi verbiühendeid nimetatakse eesti keele gram. kirjelduses väljendverbideks) Laseme jalga, löövad lokku Miski seisab savist jalgadel 2) Non-idioms, collocations in the linguistic sense kange kohv, tähelepanu pöörama, kartuleid võtma, marju korjama Of course, borderline cases exist: laulu lööma, tünga tegema

Collocations Collocations can be further divided based on their syntactic structure, e.g: 1) Noun phrase kange kohv (vrd ingl strong coffee), ere näide, helge tulevik (nn püsiepiteedid) 2) Verb + complement 3) Particle verb pöörasinviimaks tähelepanu asjaolule, et … laseme siit kähku jalga, ajalehed löövad lokku autod põrkasid kokku, sõdur jooksis vaenlase poole üle

How to extract collocations from a text corpus? 1. Frequency 1.1. Frequent word pairs (or trigrams etc): kange kohv 1.2. Frequent word pairs (or trigrams etc), separated by several intervening words: Kass laskis koerakuudi lähedusest kiiresti jalga. But most frequent word pairs in Estonian text: ei ole, see on, ta on jne 2. In addition to frequency of the word combination, also the frequencies of the words outside the combination could be taken into account -> co-occurence statistics http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf

How to extract collocations from a text corpus? 2 In practice: Triibuline kass lasi koerakuudi juurest ruttu jalga. 1. Word pairs: Variant A: adjacent pairs triibuline kass; kass lasi; lasi koerakuudi jne Variant B: words can be separated by up to n words; e.g n=3 triibuline kass; triibuline lasi; triibuline koerakuudi; triibuline juurest etc

How to extract collocations from a text corpus? 2 Variant C: candidate pairs are formed combining all words in a clause, if the information about the clause boundaries is available Whole sentence of a written language is too long context Triibuline kass lasi koerakuudi juurest ruttu jalga, aga suur koer ajas teda haukudes taga. Word pairs or lemma pairs? Morphology-based filtering – e.g only verb-particle combinations are considered

Collocation extraction tool www.rabauti.ee/clc Näited – otsi verbe, mis kollokeeruvad sõnaga eile Otsi verbe mis kollokeeruvad verbiga võima Otsi verbe, mis kollokeeruvad määrsõnaga üle Otsi sõnu, mis kollokeeruvad sõnaga plehku

Corpus-based frequency lists Based on 1 million word corpus http://www.cl.ut.ee/ressursid/sagedused/ Based on 15 million word corpus http://www.cl.ut.ee/ressursid/sagedused1/

15 mio (fiction + newspapers + science) 1 mio (fiction + newspapers) 44904 olema 27232 ja 21850 tema 18441 see 14011 mina 13813 ei 12318 et 8600 kui 8230 mis 6194 ka 5894 saama 5738 oma 5276 aga 639802 olema 403162 ja 263713 see 233960 tema 180248 mina 172779 ei 171666 et 127728 kui 124942 mis 100058 ka 78156 saama 76123 ning 66694 oma

Eesti keeletehnoloogiaprojekte, sh ka neid, mille tulemustest siin juttu oli, rahastab Eesti Keeletehnoloogia Sihtprogramm www.keeletehnoloogia.ee Tulevikus haldab ja jagab eesti keele ressursse Eesti Keeleressursside Keskus http://www.keeleressursid.ee

Eesti keele ressursid keeleteaduse allikmaterjalina

Eesti keele ressursid keeleteaduse allikmaterjalina

Presentation Transcript