230 likes | 245 Views
NooJ international Conference, Komotini , May 2010 Portability of Armenian Corpus by Nooj. Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris. Armenian: preliminaries. an Indo-European language right-branching
E N D
NooJ international Conference, Komotini, May 2010Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris
Armenian: preliminaries • an Indo-European language • right-branching • of an accusative type • typically with an SOV structure and • dominantly with an agglutinative morphology
Periodization prealphabetical alphabetical (405 A.D. – up to present). 1.Old ArmenianorGrabar(V-XI); 2. Middle Armenian (XII-XVI); 3. Modern Armenian(XVII – up to present) Western Eastern (based on Constantinople dialect) (based on Ararat dialect) dialects… dialects….
Objective Provide data compatibility and portability between Nooj and Eastern Armenian National Corpus (EANC) platform
What is Eastern Armenian National Corpus www.eanc.net Corpus Technologies Michael Daniel, Victoria Khurshudian, Dmitri Levonian, Vladimir Plungian, Alexey Polyakov,Sergey Rubakov
Source texts Grammatical dictionary PARSER Annotation algorithm Annotated texts 8
EANC HistoryMoscow, Russia • March 2006:Project Launch • July 2007:1st Release • May 2008:2nd Release • March 2009: 3rd release
Eastern Armenian National Corpus (EANC) is: • about 110 million tokens • morphological and other markup • English translations for frequent tokens • covers SEA from the mid-19th century to the present • both written and oral discourse • full-text view for over 100 Armenian classic titles • open internet access
Written Discourse • over106 mln. tokens • 510authors (1841-2009) • 1039 fiction texts (including 206translated texts) • 7858press issues • non-fiction (scientific and other) texts
Oral Discourse (3.5mln. tokens) • Spontaneous discourse • Polylogues • Task-oriented discourse • TV-shows transcripts • Movies … • EANC oral corpus has all been recorded and transcribed by the project.
Search Functionality • Token queries • Context queries • Subcorpus selection 14
Search Functionality Simple token queries: • lexeme search • wordform search • gram search • translation search • lexeme + gram search 15
Search Functionality Advanced options for token queries: • case-sensitivity • punctuation marks • position in the sentence • wildcard (*) • logical functions (e.g. ‘or' |) • negated features • grammatical/lexical homonymy inclusion/exclusion 16
Search Functionality Subcorpus selection by: • time • author(s) / title(s) • genres • types of texts (translated vs. original) • superposition of any of the above 17
Search Functionality Display options • context expanding • ‘sort by’ (time, lexeme, wordform etc.) • Latin transliteration • glossed display • KWIC (key word in the context) 18
Main Current Tasks: • Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure • Make EANC and Nooj Western Armenian platforms interportable • Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)