CorpEus: A Tool for Basque Language Research

CorpEus, a ‘web as corpus’ tooldesigned for theagglutinative nature of Basque I. Leturia, A. Gurrutxaga1, I. Alegria, A. Ezeiza2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country 2 IXA Group, University of the Basque Country, Donostia, Basque Country

Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents

No doubt corpora are necessary: for linguistic research for language normalization for developing language technologies But many corpora are exclusively used for these purposes They are not made publicly available and searchable through the Internet Motivation

For Basque, it is essential to have corpora available for querying Standardization of Basque started only in 1968 Many rules, words and spellings have been changing since; still, every now and then new rules are released by the Academy of Basque Language It was not taught in schools until the seventies and in universities until the eighties No decision as to the correct word or spelling has yet been taken in many areas or words Even written production abounds with misspellings, errors, uncertainties, etc. Motivation

Basque speaking community needs corpora Teachers Writers Technical text producers Dictionary makers Translators Students Academics in the field of standardization Basque is not a language rich in corpora Few, small and not updated Motivation

Only corpora available (I): XX. mendeko euskararen corpusa: Academy of the Basque language 4.6 million words Balanced Literary texts Twentieth century http://www.euskaracorpusa.net/XXmendea/Konts_arrunta_fr.html Motivation

Only corpora available (II): Ereduzko prosa gaur: University of the Basque Country 23.8 million words Literary and press texts regarded as “reference” 2000 - 2005 http://www.ehu.es/euskara-orria/euskara/ereduzkoa/araka.html Motivation

Only corpora available (III): Zientzia eta teknologiaren corpusa: Elhuyar Foundation and the IXA Group of the University of the Basque Country 7.6 million words Texts on science and technology 1990 - 2002 http://www.ztcorpusa.net Motivation

Only corpora available (IV): Klasikoen gordailua: Susa publishing house 10.7 million words Non-tagged Classic texts http://klasikoak.armiarma.com/corpus.htm Motivation

But we do have the Internet Huge repository of texts Constantly updated A tool for querying the Internet as if it were a Basque corpus would be very interesting Motivation

Also disadvantages: Not linguistically tagged: Always some uncertainty Variants and misspellings will not appear when looking for a word It will never show all, only what there is in the first results returned by search engines The Internet is often considered non-representative The Internet is full of redundancy Motivation

Nevertheless, we thought that the benefits far exceeded the disadvantages We embarked on a project to build a ‘web as corpus’ tool for Basque Motivation

Similar services exist: WebConc (http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi) WebCorp (http://www.webcorp.org.uk/) KWiCFinder (http://www.kwicfinder.com) But these rely on search engines Search engines don’t work well for Basque Problems with Basque language

Looking for conjugations and inflections Basque is an agglutinative language A given lemma makes many different word forms lan (“work”): lana (“the work”), lanak (“works” or “the works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)… Looking only for the exact given word, or the word plus an “s” for the plural, is not enough Wildcards are not an appropriate solution Looking for lan* would also return forms of the words lanabes (“tool”), lanbro (“fog”)… Problems with Basque language

Language discrimination No search engine offers the possibility of returning only pages in Basque Big problem when looking for: Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia… Short words: katu (“cat”), ur (“water”)… Proper nouns: Egipto, Newton, Pluton… Many non-Basque results are returned, often no Basque results at all Problems with Basque language

Lack of knowledge about the language Status of language: Late standardization Still many changes in words and rules Late teaching in schools and universities Many non-standardised areas or words Many misspellings and errors in written production A word might be incorrect but appear often in the web The user might think it is correct, without knowing that a more appropriate word exists Problems with Basque language

Looking for conjugations and inflections: Morphological query expansion (I) Morphological generator created by the IXA Group of the University of the Basque Country We obtain all the forms of a given lemma We ask the search engine for all of them using an OR operator etxe (“house”) => etxe OR etxea OR etxeak OR etxeari OR etxeek OR … Our approach

Looking for conjugations and inflections: Morphological query expansion (II) Little problems: The APIs of the search engines have each a limit in number of words or length of search phrase we had to discover the limits by trial and error Due to these limits, real lemmatised search is impossible we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words these are the forms of the words sent in the query Our approach

Language discrimination:Language-filtering words (I) We looked in a corpus for the most frequent words in Basque We include them in the search phrase using an AND operator Our approach

Language discrimination:Language-filtering words (II) Little problems (I): The most frequent words in Basque exist in other languages too Several language-filtering words had to be used the more of these, the more we gained in precision (fewer non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa we chose precision and include four filtering words if few results are returned, the user can try again increasing the recall Our approach

Language discrimination:Language-filtering words (III) Little problems (II): In bilingual pages, the searched word can be in a piece of text that is not in Basque LangId, a free language identifier developed by the IXA Group of the University of the Basque Country applied to some context around the words to see if it is in a piece of text in Basque it does not work well with small contexts, but if the context is too big pieces in other languages can be included we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached if at any time LangId says it is in Basque, we stop and we show it Our approach

Lack of knowledge about the language:Variant suggestion (I) EDBL, lexical database created by the IXA Group of the University of the Basque Country Each word is linked to its variants, common errors, old spellings, etc. When a user enters a word, its standard form or variants are suggested Our approach

Lack of knowledge about the language:Variant suggestion (II) Somehow lightens one of the problems of the non-linguistically-tagged nature of the web: in a tagged corpus, variants would be assigned the correct lemma and would appear when looking for the lemma with our approach, the user can obtain the variants too Our approach

System architecture: User enters word Query the EDBL for variants Query morphological generator to obtain conjugations and inflections Query APIs of search engines Download pages Find occurrences of the forms of the word Query LangId for language occurrences are in Show KWiCs and counts CorpEus

CorpEus Word EDBL (IXA) Variants Word, variants Morphological generator (IXA) Inflections, conjugations Word User Search phrase Search engines’ APIs URLs URLs W W W Occurrence KWiCs and counts Web pages Occurrence contexts LangId (IXA) Language

Features (I): Lemma-based search Language-filtered search Variant suggestion CorpEus

Features (II): Ambiguous or unrecognised words: The user chooses the analysis upon which to base the morphological generation CorpEus

Features (III): Search for more than one word: Lemma-based search performed for all of them Occurrences of any of the words are shown CorpEus

Features (IV): Noun phrase or term searching: Enclosing various terms in double quotes Morphological generation applied to last word Thus, proper lemma-based search for whole noun phrases or terms (in Basque, only the last component of the noun phrase or term is inflected) CorpEus

Features (V): Different ordering criteria: Pages arriving order (default) Form of searched word Context after the word Context before the word Ordered on the fly as they arrive CorpEus

Features (VI): Analysis of the words: Possible lemmas and POSs of the forms of the searched word are shown in a floating box Different colours: Light green: correct word, unambiguous Dark green: variant, unambiguous Light yellow: correct word, ambiguous Dark yellow: variant, ambiguous Red: unrecognised word CorpEus

Features (VII): Count charts: Word forms Possible lemma or POS Word before or after Lemma of word before or after … CorpEus

Features (VIII): Many textual content file types: HTML XML RSS TXT PDF DOC RTF PPT XLS … Parallel downloading of pages to avoid blocking CorpEus

Demo: http://www.corpeus.org CorpEus

CorpEus: A Tool for Basque Language Research

CorpEus: A Tool for Basque Language Research

Presentation Transcript

Principles of corpus construction

Corpus

Intro to corpus linguistics

Habeas Corpus in Your Classroom

What's on the Web? The Web as a Linguistic Corpus

Two Completely Different Nations

#1City

A Web Application for Customized Corpus Delivery

Chapter Ten Language and the Computer

What is HTK tool kit

BASQUE GASTRONOMY

E- Euskara Learning Basque on the Web

Corpus annotation and retrieval: an introduction

Nature of Science (NOS)

Introducing Corpus Linguistics

BASQUE COUNTRY GEOGRAPHY

Corpus annotation

The development of Basque and Spanish in Basque immersion programmes