120 likes | 246 Views
STO A Lexical Database of Danish for Language Technology Applications. Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001. Background. EU-funded international projects EAGLES: recommendations for morphological and syntactic specifications for 9 languages
E N D
STOA Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001
Background • EU-funded international projects • EAGLES: recommendations for morphological and syntactic specifications for 9 languages • GENELEX: development of a generic lexicon model • PAROLE: development of harmonized WL resources (lexicon, corpus) for 12 languages • SIMPLE: development of an ontology and model of semantic description for 12 languages • Follow-up • Danish, nationally funded co-operative lexicon project: STO
Aims of the project • Monolingual aim • to eliminate the usual ’bottleneck problem’: lack of a large-size Danish lexical database for • language technology applications • computational language research purposes • Multilingual aim • to provide an elaborated Danish lexical database for • linked bi- or multilingual databases for LT/NLP applications • contrastive CL and lexicology research …
STO development objectives • Requirements of monolingual applications • tailor the linguistic specifications for Danish • add more language specific features • extend the linguistic and lexical coverage • refine the lexicon structure • develop customized, user-friendly interfaces... • but also requirements of multilingual linking • keep the basic, harmonised lexicon structure • keep the principles and language of lexical description • be attentive to similar follow-up projects • Ø’more Danish’ but still consistent with the other lexicons
The three linguistic layers of description • Main info types - 3 independent but linked layers • Morphology • Inflection (pattern-based) • Spelling • Compounding • Syntax (totally pattern-based) • Syntactic frame (complementation structures & functional properties, etc.) • Control, raising (constructional properties) • Semantics (the layer of multilingual linking) • Domain (=sublanguage, source area) • Semantic relations (qualia) • Specification of meaning (SIMPLE model + core ontolgy)
Between syntax and semantics • No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description. • Ø Syntactic or semantic units? • Collocations: combine features of complex structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions ... • Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics … • ØDifferent representation strategies: ’early’ vs. ’late’
Linking lexicons at the semantic level • Basic method: • link between L1-meaning and L2-meaning • Basic requirement: • harmonized semantics (ontology, model & method) • Advantages: • proper treatment of all lexical units including • homonymes • polysemes • complex lexical units (collocations, idioms) • independent treatment of L1 and L2 wrt. morpholgy and syntax
About the STO lexical database (V.1) • Point of departure: PAROLE material • linguistic specifications elaborated (inc. also Danish) • modular lexicon architecture developed • information structure developed • 20,000 general language lexicon entries encoded • Main STO development steps: • tailor and refine the LingSpec’s for Danish • improve the information structure (DB) • add new entry types (complex lexical units, etc.) • extend the vocabulary to 50,000 entries • (~ 35,000 GL and ~15,000 LSP from 6-8 domains)
Progress report for 2001 (1) • New status:Nationally funded co-operative project • requiring • more thorough project planning (incl. ’logistics’) • more detailed information (guidelines, specifications, cross-checks, evaluation…) • Continuously ongoing supporting processes • Updating and refinement of LingSpec’s • Elaboration of an Encoding Manual • Elaboration of various additional documentation • (evaluation sheets, etc.) • Revision of the database/info structure
Progress report for 2001 (2) • New supporting tools for lexicographers developed • Encoding tools for morphological and syntactic info • Browsers for retrieval of encoded info... • Number of entries encoded with • morphological information ~50,000 • syntactic information ~23,000 • semantic information ~ 8,500 (from SIMPLE) • Other tasks (ongoing/finished) • selected entries (on customer’s request) downloaded • work on principles of statistically based selection of lemmas and syntactic constructions to be encoded • corpus-related work
Progress report for 2001 (3) • Treatment of new entry types • domain specific (LSP) entries • compounds (decomposition and linking elements implemented) • geographical proper nouns (inflectional and agreement properties investigated, the results are implemented) • collocations (information structure designed) • revision of the treatment of phrasal verbs
Summing up the goals • STO will • conform to ’general’ linguistic knowledge • meet demands of a broad application and research area (size, selection of domains and vocabulary, detail of linguistic description…) • satisfy monolingual language specific requirements • be potentially compatible with other lexical databases for future linking • be reasonable easy to access, customize/use... • perform the development contract and meet the production deadlines