150 likes | 250 Views
BUILDING BULGARIAN NooJ RESOURCES. SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. BUILDING BULGARIAN NooJ RESOURCES. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation programme) Objectives:
E N D
BUILDING BULGARIAN NooJ RESOURCES SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV
BUILDING BULGARIAN NooJ RESOURCES • The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation programme) • Objectives: • Reliable (exhaustive and precise) multilingual lexical resources for a variety of purposes such as machine translation, information extraction and information retrieval, etc.
BUILDING BULGARIAN NooJ RESOURCES • Prerequisites for carrying out such task: • Large-coverage linguistic resources such as comprehensive multilingual and monolingual dictionaries (designed according to certain criteria and stored in a format such as would ensure accessibility and manageability). • Ancillary (esp. disambiguation and recognition) resources. • An appropriate system for the storage and management of multilingual linguistic data, as well as the implementation of task-related procedures.
BUILDING BULGARIAN NooJ RESOURCES • Methodology • Systematization and unification of the existing INTEX resources as well as their conversion in compatibility with the established NooJ format. • Expansion and enhancement of the resources aiming at ever higher precision and recall parameters. • Creation of various new resources using the experience, resources and tools developed along the first two lines.
BUILDING BULGARIAN NooJ RESOURCES • Conversion of the lexical resources in DELA format to the .nod format: • Conversion of the BGD (Bulgarian Grammar Dictionary)1automata underlying the DELAF dictionaries to the .flx automata description. • Creation of automata for the existing dictionaries of compounds since they have been stored in DELACF format. Koeva, S.Grammar Dictionary of Bulgarian. Description of the concept of organization of the linguistic data. Bulgarian Language 6, pp. 49-58
BUILDING BULGARIAN NooJ RESOURCES • Conversion of the INTEX graphs into the NooJ format: • Preprocessing graphs: • Compound conjunctions graphs. • Abbreviations and elision graphs (with possible treatment in a dictionary), etc. • Recognition graphs developed along tasks involving automatic treatment of syntactic phenomena.
BUILDING BULGARIAN NooJ RESOURCES • Expanding the compound words dictionaries with new entries in a systematic way (covering large and diverse areas of the lexicon`s inventory of compounds). • Establishing the resources to be used: • The available specialised on-line dictionaries • The lexical-semantic data base - the Bulgarian WordNet. • Developing automata for the inflection types in the established format.
BUILDING BULGARIAN NooJ RESOURCES • Specifics: • Restricted paradigms for certain types of compounds (esp. domain-specific terms) – pluralia tantum, singularia tantum, count forms, plural endings. • Invariable forms or forms that are not established in the Bulgarian language, esp. ones introduced in the language as transcription of mainly English terms, etc. (hedge, swap, bear market, bull market, etc.)
BUILDING BULGARIAN NooJ RESOURCES • Compounds extraction from the above mentioned resources (enhanced complementarily): • Extraction of thematic compound dictionaries of terms, named entities, other compound lexemes (using semantic relations encoded in the data base and employing inheritance to the task). • Employing NooJ as environment for compounds extraction, processing of the obtained material with the already designed dictionaries and encoding of the appropriate candidates among the unrecognized tokens.
BUILDING BULGARIAN NooJ RESOURCES • Dictionaries generation enhancement • Exploring large data bases and spotting different head words inflection types using the existing automata: • Using chiefly Bulgarian WordNet where head words of compounds are marked unambiguously. • Using simple syntactic grammars (identifying NPs) to spot head words in the available domain specific dictionaries of concepts and terms (more comprehensive with regard to the coverage of types of inflection).
BUILDING BULGARIAN NooJ RESOURCES • Recognition enhancement • Development of morphological grammars embracing certain classes of words not present currently in any dictionary, provided the source words are in the dictionary: • Personal feminine nouns приятел (friend) - приятелка(girl friend) • Diminutive nouns – детенце(a small child), кученце(a small dog), etc. • Verbal nouns, etc.
BUILDING BULGARIAN NooJ RESOURCES • Present day and future directions: • Information retrieval, machine translation, etc. • Facilitating linguistic tasks by supplying the prerequisites - large resources as input data – for the exploration of linguistic phenomena, validation of linguistic hypotheses on language material. • Education (facilitating the acquisition of knowledge and skills in NLP)