Corpus lexicography in Russia: recent trends and perspectives

Maria Khokhlova St.Petersburg State University Philological Faculty khokhlova.marie@gmail.com Corpus lexicography in Russia: recent trends and perspectives

Prehistory of Russian Corpus Linguistics‏ Frequency Dictionary of Russian: (L.N.Zasorina, 1977) Text database contained about 1 mln units. During its compilation a huge number of notorious issues were discussed: • representiveness; • tokenization; • lemmatization... So it was the earliest computer corpus of Russian.

Prehistory of Russian Corpus Linguistics «Computer Fund of the Russian Language» Idea: Acad. Andrey Yershov Andrey Petrovich Yershov (1931-1988)

Jeršov A.P. "On methodology of constructing dialogue systems: the phenomenon of business prosa" (1978) The idea was formulated as follows: "Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of the Russian language is solved. We hope that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimize labour costs and simultaneously would protect the Russian language from arbitrary and incompetent intervention“.

Russian Corpora (1)‏ The Uppsala Russian Corpus (1960s), the earliest corpus The Tübingen Russian Corpus (Tübingen Universität, in 1999 -2004 under the guidance of T.Berger) The HANCO corpus (Helsinki Annotated Corpus), Helsinki University, Slavic and Baltic Languages Department (2001-2004, A. Mustajoki, M. Kopotev). It is a small teaching corpus with morphological and syntactic annotation.

Russian Corpora (2)‏ Three big corpora of Russian: The National Corpus of Russian Language (NCRL, about 364 million words) (http://ruscorpora.ru Corpora at the Leeds University created by S.Sharoff (about 2000 million words) (http://corpus.leeds.ac.uk/ruscorpora.html) A corpus of Russian Fiction at the Automatic Text Processing initiative team (AOT), 680 million words (http://aot.ru).

Russian National Corpus (1) Over 364 million words Based on Yandex Search: Search by exact form(s); Lexico-grammatical search. see www.yandex.ru – Advanced Search and www.ruscorpora.ru – Search in the Corpus Additional options: morphological features; semantic features; metadata.

Russian National Corpus (2) Subcorpora: Modern Russian corpus, Diachronic corpus (the Church Slavonic language), Syntactic corpus, Spoken corpus, News corpus, Parallel corpora, Poetic corpus, Dialect corpus, Speech corpus, Multimodal corpus

Dictionaries based on the Russian National Corpus Grammatical Dictionary of Russian Neologisms; New Frequency Dictionary of Russian; The Combinatory Dictionary of Russian Intensifiers; The Verbal Combinatory Dictionary of Russian Abstract Nouns http://dict.lang.ru

AOT (1)

AOT(2)

Russian Corpora (Leeds University, Serge Sharoff) Russian Reference Corpus Russian Reference Corpus, another version Russian Fiction (disambiguated) Russian Newspapers Russian Internet Corpus Russian National Corpus …

Collocations

St.Petersburg Corpus of Hagiographic Texts Biographies of saints and holy people; 50 manuscripts; 500 000 tokens http://project.phil.spbu.ru/scat/page.php?page=project

The Fundamental Digital Library of Russian Literature and Folklore FEB-web accumulates information in text, audio, visual, and other forms on 11th-20th-century Russian literature, Russian folklore, and the history of Russian literary scholarship and folklore studies.

Conference “Corpus Linguistics” 2002 2004 2006 2008 2011 2013 (late June)‏ Saint-Petersburg St.Petersburg State University, Department of Mathematical Linguistics

Thank you for your attention!

Corpus lexicography in Russia: recent trends and perspectives