CorpEus: A Tool for Basque Language Research

CorpEus is a web corpus tool designed for the agglutinative nature of Basque, aiming to address the lack of publicly available and searchable Basque language corpora. This tool provides a comprehensive solution for linguistic research, language normalization, and technology development needs. It offers users the ability to query Basque language data, essential for various sectors such as education, literature, and translation. Despite challenges like language discrimination and variations, CorpEus stands out as a valuable resource for the Basque-speaking community.

CorpEus: A Tool for Basque Language Research

  1. CorpEus, a ‘web as corpus’ tooldesigned for theagglutinative nature of Basque I. Leturia, A. Gurrutxaga1, I. Alegria, A. Ezeiza2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country 2 IXA Group, University of the Basque Country, Donostia, Basque Country

  4. No doubt corpora are necessary: for linguistic research for language normalization for developing language technologies But many corpora are exclusively used for these purposes They are not made publicly available and searchable through the Internet Motivation

  5. For Basque, it is essential to have corpora available for querying Standardization of Basque started only in 1968 Many rules, words and spellings have been changing since; still, every now and then new rules are released by the Academy of Basque Language It was not taught in schools until the seventies and in universities until the eighties No decision as to the correct word or spelling has yet been taken in many areas or words Even written production abounds with misspellings, errors, uncertainties, etc. Motivation

  6. Basque speaking community needs corpora Teachers Writers Technical text producers Dictionary makers Translators Students Academics in the field of standardization Basque is not a language rich in corpora Few, small and not updated Motivation

  7. Only corpora available (I): XX. mendeko euskararen corpusa: Academy of the Basque language 4.6 million words Balanced Literary texts Twentieth century http://www.euskaracorpusa.net/XXmendea/Konts_arrunta_fr.html Motivation

  8. Only corpora available (II): Ereduzko prosa gaur: University of the Basque Country 23.8 million words Literary and press texts regarded as “reference” 2000 - 2005 http://www.ehu.es/euskara-orria/euskara/ereduzkoa/araka.html Motivation

  9. Only corpora available (III): Zientzia eta teknologiaren corpusa: Elhuyar Foundation and the IXA Group of the University of the Basque Country 7.6 million words Texts on science and technology 1990 - 2002 http://www.ztcorpusa.net Motivation

  10. Only corpora available (IV): Klasikoen gordailua: Susa publishing house 10.7 million words Non-tagged Classic texts http://klasikoak.armiarma.com/corpus.htm Motivation

  11. But we do have the Internet Huge repository of texts Constantly updated A tool for querying the Internet as if it were a Basque corpus would be very interesting Motivation

  12. Also disadvantages: Not linguistically tagged: Always some uncertainty Variants and misspellings will not appear when looking for a word It will never show all, only what there is in the first results returned by search engines The Internet is often considered non-representative The Internet is full of redundancy Motivation

  13. Nevertheless, we thought that the benefits far exceeded the disadvantages We embarked on a project to build a ‘web as corpus’ tool for Basque Motivation

  16. Similar services exist: WebConc (http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi) WebCorp (http://www.webcorp.org.uk/) KWiCFinder (http://www.kwicfinder.com) But these rely on search engines Search engines don’t work well for Basque Problems with Basque language

  17. Looking for conjugations and inflections Basque is an agglutinative language A given lemma makes many different word forms lan (“work”): lana (“the work”), lanak (“works” or “the works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)… Looking only for the exact given word, or the word plus an “s” for the plural, is not enough Wildcards are not an appropriate solution Looking for lan* would also return forms of the words lanabes (“tool”), lanbro (“fog”)… Problems with Basque language

  18. Language discrimination No search engine offers the possibility of returning only pages in Basque Big problem when looking for: Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia… Short words: katu (“cat”), ur (“water”)… Proper nouns: Egipto, Newton, Pluton… Many non-Basque results are returned, often no Basque results at all Problems with Basque language

  19. Lack of knowledge about the language Status of language: Late standardization Still many changes in words and rules Late teaching in schools and universities Many non-standardised areas or words Many misspellings and errors in written production A word might be incorrect but appear often in the web The user might think it is correct, without knowing that a more appropriate word exists Problems with Basque language

  22. Looking for conjugations and inflections: Morphological query expansion (I) Morphological generator created by the IXA Group of the University of the Basque Country We obtain all the forms of a given lemma We ask the search engine for all of them using an OR operator etxe (“house”) => etxe OR etxea OR etxeak OR etxeari OR etxeek OR … Our approach

  23. Looking for conjugations and inflections: Morphological query expansion (II) Little problems: The APIs of the search engines have each a limit in number of words or length of search phrase we had to discover the limits by trial and error Due to these limits, real lemmatised search is impossible we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words these are the forms of the words sent in the query Our approach

  24. Language discrimination:Language-filtering words (I) We looked in a corpus for the most frequent words in Basque We include them in the search phrase using an AND operator Our approach

  25. Language discrimination:Language-filtering words (II) Little problems (I): The most frequent words in Basque exist in other languages too Several language-filtering words had to be used the more of these, the more we gained in precision (fewer non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa we chose precision and include four filtering words if few results are returned, the user can try again increasing the recall Our approach

  26. Language discrimination:Language-filtering words (III) Little problems (II): In bilingual pages, the searched word can be in a piece of text that is not in Basque LangId, a free language identifier developed by the IXA Group of the University of the Basque Country applied to some context around the words to see if it is in a piece of text in Basque it does not work well with small contexts, but if the context is too big pieces in other languages can be included we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached if at any time LangId says it is in Basque, we stop and we show it Our approach

  27. Lack of knowledge about the language:Variant suggestion (I) EDBL, lexical database created by the IXA Group of the University of the Basque Country Each word is linked to its variants, common errors, old spellings, etc. When a user enters a word, its standard form or variants are suggested Our approach

  28. Lack of knowledge about the language:Variant suggestion (II) Somehow lightens one of the problems of the non-linguistically-tagged nature of the web: in a tagged corpus, variants would be assigned the correct lemma and would appear when looking for the lemma with our approach, the user can obtain the variants too Our approach

  31. System architecture: User enters word Query the EDBL for variants Query morphological generator to obtain conjugations and inflections Query APIs of search engines Download pages Find occurrences of the forms of the word Query LangId for language occurrences are in Show KWiCs and counts CorpEus

  32. CorpEus Word EDBL (IXA) Variants Word, variants Morphological generator (IXA) Inflections, conjugations Word User Search phrase Search engines’ APIs URLs URLs W W W Occurrence KWiCs and counts Web pages Occurrence contexts LangId (IXA) Language

  33. Features (I): Lemma-based search Language-filtered search Variant suggestion CorpEus

  34. Features (II): Ambiguous or unrecognised words: The user chooses the analysis upon which to base the morphological generation CorpEus

  35. Features (III): Search for more than one word: Lemma-based search performed for all of them Occurrences of any of the words are shown CorpEus

  36. Features (IV): Noun phrase or term searching: Enclosing various terms in double quotes Morphological generation applied to last word Thus, proper lemma-based search for whole noun phrases or terms (in Basque, only the last component of the noun phrase or term is inflected) CorpEus

  37. Features (V): Different ordering criteria: Pages arriving order (default) Form of searched word Context after the word Context before the word Ordered on the fly as they arrive CorpEus

  38. Features (VI): Analysis of the words: Possible lemmas and POSs of the forms of the searched word are shown in a floating box Different colours: Light green: correct word, unambiguous Dark green: variant, unambiguous Light yellow: correct word, ambiguous Dark yellow: variant, ambiguous Red: unrecognised word CorpEus

  39. Features (VII): Count charts: Word forms Possible lemma or POS Word before or after Lemma of word before or after … CorpEus

  40. Features (VIII): Many textual content file types: HTML XML RSS TXT PDF DOC RTF PPT XLS … Parallel downloading of pages to avoid blocking CorpEus

  41. Demo: http://www.corpeus.org CorpEus

