80 likes | 290 Views
Accessing the Czech National Corpus. Michal Křen michal.kren@ff.cuni.cz Institute of the Czech National Corpus Charles University, Prague. SLAVICORP Warszawa , 22 November 2010. Outline of the talk. The Czech National Corpus (CNC) Available CNC corpora Accessing the CNC Demonstration.
E N D
Accessing the Czech National Corpus Michal Křen michal.kren@ff.cuni.cz Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010
Outline of the talk • The Czech National Corpus (CNC) • Available CNC corpora • Accessing the CNC • Demonstration
The Czech National Corpus • long-term project aiming (not only) at continuous mapping of contemporary Czech • compilation, maintenance and providing public access to various corpora: synchronic / diachronic written / spoken monolingual / multilingual balanced / not balanced (large) general / specialised corpora CNC-compiled / hosted corpora corpus hosting is a service provided by the ICNC to institutions that compile corpora, but lack capacities and / or appropriate know-how for • technical processing (format conversion, unification, quality control etc.) • public release, server maintenance etc.
Available CNC corpora Synchronic written corpora (the SYN-series) •the size is given in words proper (excluding numbers and punctuation) • the balanced SYN-series corpora: cover consecutive time periods, aim to represent written language of that period, emphasis on variability of sources General features • disjoint, i.e. any document can be included only into one of them • invariable entities once published, identical queries always give identical results • processing differences - lemmatisation, tagging, segmentation etc. => super-corpusSYN: unification of all the SYN-series corpora, updated when needed, consistently re-processed with state-of-the-art versions of available tools; the total size of SYN will thus soon reach 1.3 billion words proper
Available CNC corpora Synchronic written corpora continued - specialised and hosted corpora Synchronic spoken corpora Diachronic corpora
InterCorp • aims at building a large parallel synchronic corpus covering a number of languages: bgdadeenesfifrhrhuitltlvnlnoplptroruskslsrsv • Czech is the pivot language • mostly fiction with manually corrected alignment • supplemented by automatically aligned political commentaries published by Project Syndicate (de en es fr ru); more sources in the future (Presseurop.eu) • lemmatisation and/or tagging where possible: bgdeenes fr huitltnlnoplrusk • incremental (not invariable), its size and the number of languages are growing •currently searchable 49 million foreign-language words in aligned segments •another 19 million words prepared for publication The CNC tasks • project administration, central coordination and funding • central data storage, standardisation, data processing, quality assurance • support to the coordinators of individual languages (manuals, tutorials etc.); the coordinators choose and supervise their own collaborators (mostly students) • development of special software, mainly search interface (Park) and central database including text alignment and administration tools (InterText) Credits: Alexandr Rosen, Michal Štourač, Martin Vavřín, Pavel Vondřička et al.
Accessing the CNC server Manatee(Pavel Rychlý, FI MU Brno) + fast, powerful query language, GNU GPL - no documentation monolingual clients Bonito 1 (Pavel Rychlý, FI MU Brno) - stand-alone application, Tcl/Tk + various functions, very popular, GNU GPL - old architecture: requires installation, 5016 port, no Unicode support Bonito 2 or The Sketch Engine (Pavel Rychlý, FI MU Brno) - web-based, Python + does not require installation, http, supports Unicode - lacks some functionality, confused interface, unclear licensing multilingual client Park (Michal Štourač, ICNC) - web-based, Python + does not require installation, http, supports Unicode - being developed, lacks a lot of functionality ( =>non-parallel versions of the InterCorp texts are made accessible also via Bonito 2)