620 likes | 756 Views
CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. I. Leturia, A. Gurrutxaga 1 , I. Alegria, A. Ezeiza 2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country
E N D
CorpEus, a ‘web as corpus’ tooldesigned for theagglutinative nature of Basque I. Leturia, A. Gurrutxaga1, I. Alegria, A. Ezeiza2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country 2 IXA Group, University of the Basque Country, Donostia, Basque Country
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
No doubt corpora are necessary: for linguistic research for language normalization for developing language technologies But many corpora are exclusively used for these purposes They are not made publicly available and searchable through the Internet Motivation
For Basque, it is essential to have corpora available for querying Standardization of Basque started only in 1968 Many rules, words and spellings have been changing since; still, every now and then new rules are released by the Academy of Basque Language It was not taught in schools until the seventies and in universities until the eighties No decision as to the correct word or spelling has yet been taken in many areas or words Even written production abounds with misspellings, errors, uncertainties, etc. Motivation
Basque speaking community needs corpora Teachers Writers Technical text producers Dictionary makers Translators Students Academics in the field of standardization Basque is not a language rich in corpora Few, small and not updated Motivation
Only corpora available (I): XX. mendeko euskararen corpusa: Academy of the Basque language 4.6 million words Balanced Literary texts Twentieth century http://www.euskaracorpusa.net/XXmendea/Konts_arrunta_fr.html Motivation
Only corpora available (II): Ereduzko prosa gaur: University of the Basque Country 23.8 million words Literary and press texts regarded as “reference” 2000 - 2005 http://www.ehu.es/euskara-orria/euskara/ereduzkoa/araka.html Motivation
Only corpora available (III): Zientzia eta teknologiaren corpusa: Elhuyar Foundation and the IXA Group of the University of the Basque Country 7.6 million words Texts on science and technology 1990 - 2002 http://www.ztcorpusa.net Motivation
Only corpora available (IV): Klasikoen gordailua: Susa publishing house 10.7 million words Non-tagged Classic texts http://klasikoak.armiarma.com/corpus.htm Motivation
But we do have the Internet Huge repository of texts Constantly updated A tool for querying the Internet as if it were a Basque corpus would be very interesting Motivation
Also disadvantages: Not linguistically tagged: Always some uncertainty Variants and misspellings will not appear when looking for a word It will never show all, only what there is in the first results returned by search engines The Internet is often considered non-representative The Internet is full of redundancy Motivation
Nevertheless, we thought that the benefits far exceeded the disadvantages We embarked on a project to build a ‘web as corpus’ tool for Basque Motivation
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Similar services exist: WebConc (http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi) WebCorp (http://www.webcorp.org.uk/) KWiCFinder (http://www.kwicfinder.com) But these rely on search engines Search engines don’t work well for Basque Problems with Basque language
Looking for conjugations and inflections Basque is an agglutinative language A given lemma makes many different word forms lan (“work”): lana (“the work”), lanak (“works” or “the works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)… Looking only for the exact given word, or the word plus an “s” for the plural, is not enough Wildcards are not an appropriate solution Looking for lan* would also return forms of the words lanabes (“tool”), lanbro (“fog”)… Problems with Basque language
Language discrimination No search engine offers the possibility of returning only pages in Basque Big problem when looking for: Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia… Short words: katu (“cat”), ur (“water”)… Proper nouns: Egipto, Newton, Pluton… Many non-Basque results are returned, often no Basque results at all Problems with Basque language
Lack of knowledge about the language Status of language: Late standardization Still many changes in words and rules Late teaching in schools and universities Many non-standardised areas or words Many misspellings and errors in written production A word might be incorrect but appear often in the web The user might think it is correct, without knowing that a more appropriate word exists Problems with Basque language
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Looking for conjugations and inflections: Morphological query expansion (I) Morphological generator created by the IXA Group of the University of the Basque Country We obtain all the forms of a given lemma We ask the search engine for all of them using an OR operator etxe (“house”) => etxe OR etxea OR etxeak OR etxeari OR etxeek OR … Our approach
Looking for conjugations and inflections: Morphological query expansion (II) Little problems: The APIs of the search engines have each a limit in number of words or length of search phrase we had to discover the limits by trial and error Due to these limits, real lemmatised search is impossible we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words these are the forms of the words sent in the query Our approach
Language discrimination:Language-filtering words (I) We looked in a corpus for the most frequent words in Basque We include them in the search phrase using an AND operator Our approach
Language discrimination:Language-filtering words (II) Little problems (I): The most frequent words in Basque exist in other languages too Several language-filtering words had to be used the more of these, the more we gained in precision (fewer non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa we chose precision and include four filtering words if few results are returned, the user can try again increasing the recall Our approach
Language discrimination:Language-filtering words (III) Little problems (II): In bilingual pages, the searched word can be in a piece of text that is not in Basque LangId, a free language identifier developed by the IXA Group of the University of the Basque Country applied to some context around the words to see if it is in a piece of text in Basque it does not work well with small contexts, but if the context is too big pieces in other languages can be included we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached if at any time LangId says it is in Basque, we stop and we show it Our approach
Lack of knowledge about the language:Variant suggestion (I) EDBL, lexical database created by the IXA Group of the University of the Basque Country Each word is linked to its variants, common errors, old spellings, etc. When a user enters a word, its standard form or variants are suggested Our approach
Lack of knowledge about the language:Variant suggestion (II) Somehow lightens one of the problems of the non-linguistically-tagged nature of the web: in a tagged corpus, variants would be assigned the correct lemma and would appear when looking for the lemma with our approach, the user can obtain the variants too Our approach
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
System architecture: User enters word Query the EDBL for variants Query morphological generator to obtain conjugations and inflections Query APIs of search engines Download pages Find occurrences of the forms of the word Query LangId for language occurrences are in Show KWiCs and counts CorpEus
CorpEus Word EDBL (IXA) Variants Word, variants Morphological generator (IXA) Inflections, conjugations Word User Search phrase Search engines’ APIs URLs URLs W W W Occurrence KWiCs and counts Web pages Occurrence contexts LangId (IXA) Language
Features (I): Lemma-based search Language-filtered search Variant suggestion CorpEus
Features (II): Ambiguous or unrecognised words: The user chooses the analysis upon which to base the morphological generation CorpEus
Features (III): Search for more than one word: Lemma-based search performed for all of them Occurrences of any of the words are shown CorpEus
Features (IV): Noun phrase or term searching: Enclosing various terms in double quotes Morphological generation applied to last word Thus, proper lemma-based search for whole noun phrases or terms (in Basque, only the last component of the noun phrase or term is inflected) CorpEus
Features (V): Different ordering criteria: Pages arriving order (default) Form of searched word Context after the word Context before the word Ordered on the fly as they arrive CorpEus
Features (VI): Analysis of the words: Possible lemmas and POSs of the forms of the searched word are shown in a floating box Different colours: Light green: correct word, unambiguous Dark green: variant, unambiguous Light yellow: correct word, ambiguous Dark yellow: variant, ambiguous Red: unrecognised word CorpEus
Features (VII): Count charts: Word forms Possible lemma or POS Word before or after Lemma of word before or after … CorpEus
Features (VIII): Many textual content file types: HTML XML RSS TXT PDF DOC RTF PPT XLS … Parallel downloading of pages to avoid blocking CorpEus
Demo: http://www.corpeus.org CorpEus
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents
Motivation Problems with Basque language Our approach CorpEus, a ‘web as corpus’ tool for Basque EusBila, a search service for Basque Evaluation Contents