1 / 129

Infrastructural Language Resources & Standards for Multilingual Computational Lexicons

Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari … with many others Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. The ENABLER Mission.

feivel
Download Presentation

Infrastructural Language Resources & Standards for Multilingual Computational Lexicons

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari … with many others Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it Pisa, September 2004

  2. The ENABLER Mission • Language Resources (LRs) & Evaluation: central component of the “linguistic infrastructure” • LRs supported by national funding in National Projects • Availability of LRs also a “sensitive” issue, touching the sphere of linguistic and cultural identity, but also with economical and political implications The ENABLER Network of National initiatives, aims at “enabling” the realisation of a cooperative framework • formulate acommon agenda of medium- & long-term research priorities • contribute to the definition of an overall framework for the provision of LRs Pisa, September 2004

  3. towards …. Only • Combining the strengths of different initiatives & communities • Exploiting at best the ‘modus operandi’ of the national funding authorities in different national situations • Responding to/anticipating needs and priorities of R&D & industrial communities • Promoting the adoption of [de facto] standards, best practices • With a clear distinction of tasks & roles for different actors We can produce the synergies, economy of scale, convergence & critical mass necessary to provide the infrastructural LRs needed to realise the full potential of a multilingual global information society Pisa, September 2004

  4. Lexicon and Corpus:a multi-faceted interaction • L  Ctagging • C  L frequencies (of different linguistic “objects”) • C  Lproper nouns, acronyms, … • L  Cparsing, chunking, … • C  Ltraining of parsers • C  Llexicon updating • C  L“collocational” data (MWE, idioms, gram. patterns ...) • C  L “nuances” of meanings & semantic clustering • C  L acquisition of lexical (syntactic/semantic) knowledge • L  Csemantic tagging/word-sense disambiguation •  (e.g. in Senseval) • C  Lmore semantic information on LE • C  Lcorpus based computational lexicography • C  Lvalidation of lexical models • C  L… • L  C... Pisa, September 2004

  5. ...Language as a “Continuum” Lexicon & Corpus as two viewpoints on the same ling. object …. even more in a multilingual context Interesting - and intriguing - aspects of corpus use: • impossibilityof descriptions based on a clear-cut boundary betw. what is admitted and what is not • in actual usage, language displays a large number of properties behaving as a continuum, and not as properties of “yes/no” type • the same is true for the so-called “rules”, where we find more a “tendency”towards rules than precise rules in corpus evidence • difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary BUT Pisa, September 2004

  6. Extraction from texts vs.formal representation in lexicons • It is difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary • Therigour and lack of flexibility of formal representation languages causes difficulties when mapping into it NL word meaning, ambiguous and flexible by its own nature • No clear-cut boundary when analysing many phenomena: it’s more a continuum • The same impression if one looks at examples of types of alternations: • no clear-cut classes across languages • or within one language Pisa, September 2004

  7. Correlation between different levels of linguistic description in the design of a lexical entry To understand word-meaning: • Focus on the correlation between syntactic and semantic aspects • But other linguistic levels - such as morphology, morphosyntax, lexical cooccurrence, collocational data, etc. - are closely interrelated/involved • These relations must be captured when accounting for meaning discrimination • The complexity of these interrelationships makes semanticdisambiguation such a hard task in NLP • Textual corpora as a device to discover and reveal the intricacy of these relationships • Frame/SIMPLE semantics as a device to unravel and disentangle the complex situation into elementary and computationally manageable pieces Pisa, September 2004

  8. towardsCorpus based Semantic Lexicons… at least in principle • both in the design of the model , & • in the building of the lexicon(at least partially) • with (semi-)automatic means Design of the lexical entry with a combined approach: • theoretical: e.g. Fillmore Frame Semantics/ Pustejovsky Generative Lexicon, … • empirical: Corpus evidence • even if: not always there are sound and explicit criteria for classification according to “frame elements”/qualia relations/... Pisa, September 2004

  9. Infrastructure of Language Resources... ...static • Semantic networks: Euro-/ItalWordNet • Lexicons: PAROLE/SIMPLE/CLIPS • TreeBanks International Standards But … they will never be “complete” …dynamic • Lexical acquisitionsystems(syntactic & semantic) from corpora • Infrastructure of tools • Robust morphosyntactic & syntactic analysers • Word-sensedisambiguation systems • Sense classifiers • ... Pisa, September 2004

  10. ItalWordNet Semantic Network [Italian module of EuroWordNet] • ~50.000 lemmas organized in synonym groups (synsets), structured in hierarchies & linked by ~130.000 semantic relations • ~ 50.000 hyperonymy/hyponymy relations • ~ 16.000 relations among different POS (role, cause, derivation, etc..) • ~ 2.000 part-whole relations • ~ 1.500 antonymy relations, …etc. • Synsets linked to the InterLingual Index (ILI=Princeton WordNet), • Through the ILI link to all the European WordNets (de-facto standard) • & to the common Top Ontology • Possibility of plug-in withdomain terminological lexicons • (legal, maritime) • Usable in IR, CLIR, IE, QA, ... Pisa, September 2004

  11. EuroWordNet Multilingual Data Structure Pisa, September 2004

  12. TOP Concepts:Object,Artifact,Building Hyperonym: {edificio,..} home, domicile, .. house {Casa, abitazione, dimora } Role_location:{stare, abitare, ...} Hyponym: {villetta } {catapecchia, bicocca, .. } {cottage} {bungalow } Role_target_direction:{rincasare} Role_patient: {affitto, locazione} Mero_part:{vestibolo} {stanza} Synsets linked by Semantic Relations in ItalWordNet Holo_part:{casale} {frazione} {caseggiato} Pisa, September 2004

  13. Jur-WordNet With ITTG-CNR (Istituto di Teoria e Tecniche dell’informazione Giuridica) • Jur-WordNetðExtension for the juridical domain of ItalWordNet • Knowledge base for multilingual access to sources of legal information • Source of metadata for semantic mark-up of legal texts • To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc. Pisa, September 2004

  14. Terminological Lexicon of Navigation & Sea Transportation ð Nolo Synsets ð 1.614 Lemmas ð 2.116 Senses ð 2.232 Nouns ð 1.621 Verbs ð 205 Adjectives ð 35 Proper Nouns ð 236 Pisa, September 2004

  15. PAROLE/SIMPLE 12 harmonised computational lexicons PAROLECorpus http://www.ilc.cnr.it/clips/ SIMPLE Ital. Sem. Lex. ’98-2000 PAROLE Ital. Synt. Lex. ’96-’98 SGML SGML semantics: 10,000 senses morphology: 20,000 entriessyntax: 20,000 words CLIPS 2000-2004 phonologymorphology 55,000 words syntax semantics: 55,000 senses XML Pisa, September 2004

  16. machine language learning Pisa, September 2004

  17. linguistic learning machine language learning development of conceptual networks linguistic change models language usage models adaptive classification systems information extraction bootstrapping of lexical information bootstrapping of grammars Pisa, September 2004

  18. structured knowledge Architecture for linguisticknowledge acquisition ... terminology unstructured text data annotation tools LKG cross-lingual information retrieval annotated data lexica multi-lingual information extraction lexica machine learning for linguistic knowledge acquisition lexicon model user needs multi-lingual text mining …. towards “dynamic” lexicons, able to auto-enrich Pisa, September 2004

  19. Harmonisation:More & moreNeed of a Global Viewfor Global Interoperability Integration/sharing of data & software/tools • Need of compatibility among various components • An “exemplary cycle”: Formalisms Grammars Software: Taggers, Chunkers, Parsers, … Representation Annotation Lexicon Corpora Terminology Software: Acquisition Systems I/O Interfaces Languages Pisa, September 2004

  20. A short guide to ISLE/EAGLES http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm Multilingual Computational Lexicon Working Group Pisa, September 2004

  21. Target: … the Multilingual ISLE Lexical Entry(MILE) • General methodological principles (from EAGLES): • high granularity: factor outthe (maximal) set ofprimitive units of lexical info (basic notions) with the highest degree of inter-theoretical agreement • modular and layered:various degrees of specification possible • explicit representation of info • allow for underspecification (& hierarchical structure) • leading principle: edited unionof existing lexicons/models (redundancyisnot a problem) • open to different paradigms ofmultilinguality • oriented to the creation oflarge-scale & distributed lexicons Pisa, September 2004

  22. a list of critical information types that will compose each module of the MILE Paths to Discover theBasic Notions of MILE • clues in dictionaries to decide on target equivalent • guidelines for lexicographers • clues (to disambiguate/translate) in corpus concordances • lexical requirements from various types of transfer conditions & actions in MT systems • lexical requirements from interlingua-based systems • … Pisa, September 2004

  23. Designing MILESteps towards MILE: • Creating entries (Bertagna, Reeves, Bouillon) • Identifying the MILE Basic Notions (Bertagna,Monachini,Atkins,Bouillon) • Defining the MILE Lexical Model (Lenci, Calzolari, etc.) • Formalising MILE (Ide) • Development of the ISLE Lexical Tool (Bel) • ISLE & spoken language & multimodality (Gibbon) • Metadata for the lexicon (Peters, Wittenburg) • A case-study: MWEs in MILE (Quochi, lenci, Calzolari) • the MILE Basic Notions • the MILE Lexical Model Pisa, September 2004

  24. The MILE Basic Notions (the EAGLES/ISLE CLWG) • Basic lexical dimensions & info-types relevant to establish multilingual links • Typology of lexicalmultilingual correspondences (relevant conditions & actions) Identified by: • creating sample multilingual lexical entries (Bertagna, Reeves) • investigating the use of sense indicators in traditional bilingual dictionaries (Atkins, Bouillon) • …. Pisa, September 2004

  25. The MILE Lexical Classes – Data Categories for Content Interoperability Francesca Bertagna*, Alessandro Lenci°, Monica Monachini*, Nicoletta Calzolari* *ILC–CNR – Pisa °Pisa University Pisa, September 2004

  26. Overview • MILE Lexical Model with Lexical Objects and Data Categories • Mapping of existing lexicons onto MILE • RDF schema and DC Registry for some pre-instantiated lexical objects together with a sample entry from the PAROLE-SIMPLE lexicons in MILE • Future … Pisa, September 2004

  27. Computational Lexicon Working Group The MILE Lexical Model Guidelines syntactic semantic lexicons GENELEX Model PAROLE-SIMPLE Lexicons Multilingual Lexicons (EuroWordNet, etc.) … where after? MILE Lexical Model Pisa, September 2004

  28. The MILE Main Features • A general architecture devised as a common representational layer for multilingual Computational Lexicons • both for hand-coded and corpus-driven lexical data Key features: • Modularity • Granularity • Extensibility and “openess”- User-adaptability • Resource Sharing • Content Interoperability • Reusability Semantic Web technologies & standards applied at Lexicon modelling Pisa, September 2004

  29. The MILE Lexical Model (MLM) • The MLM core is the Multilingual ISLE Lexical Entry (MILE) • a general schema for multilingual lexical resources • a lexical meta-entry as a common representational layer for multilingual lexicons • Computational lexicons can be viewed as different instancesof the MILE schema MILE Lexical Model lexicon#2 lexicon#1 lexicon#3 Pisa, September 2004

  30. MILEthe building-block model • The MILE architecture is designed according to the building-block model: • Lexical entries are obtained by combining various types of lexical objects (atomic and complex) • Users design their lexicon by: • selecting and/or specifying the relevant lexical objects • combine the lexical objects into lexical entries • Lexical objects may be shared: • within the same lexicon (intra-lexicon reusability) • among different lexicons (inter-lexicon reusability) Pisa, September 2004

  31. Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Objects Sem feature syntactic frame slot Syn feature phrase MILEthe building-block model Pisa, September 2004

  32. semantic layer linking conditions syntactic layer morphological layer mono-Mile mono-MILE Modularity in MILE Horizontal organization, where independent, but interlinked, modules allow to express different dimensions of lexical entries multi-MILE multilingual correspondence conditions multiple levels of modularity Pisa, September 2004

  33. Each monolingual layer within Mono-MILE identifies a basicunit of lexical description The Mono-MILE SemU basic unit to describe the semantic properties of the MU semantic layer basic unit to describe the syntactic behaviour of the MU SynU syntactic layer basic unit to describe the inflectional and derivational morphological properties of the word MU morphological layer Pisa, September 2004

  34. SemU SynU SemU SemU SynU SemU SynU SemU SynU SemU SemU The Mono-MILE MU Within each layer, a basic linguistic information unit is identified Pisa, September 2004

  35. Granularity in MILE • Concerns the vertical dimension. Within a given lexical layer, varying degrees of depth of lexical descriptions are allowed, both shallow and deep lexical representations Pisa, September 2004

  36. Defining the MLM • The MLM is designed as an E-R model (MILE Entry Schema) • defines the lexical objects and the ways they can be combined into a lexical entry • The MLM includes 3 types of lexical objects: • MILE Lexical Classes (MLC) • MILE Lexical Data Categories (MDC) • MILE Lexical Operations (MLO) Pisa, September 2004

  37. The MILE Lexical Objects • Within each layer, basic lexical notions are represented by lexical objects: • MILE Lexical Classes MLC • MILE Data Categories MDC • Lexical operations • They are an ontology of lexical objectsas an abstraction over different lexical models and architectures Pisa, September 2004

  38. The MILE E/R diagrams • The lexical objects are described with E-R diagrams which define them and the ways they can be combined into a lexical entry Pisa, September 2004

  39. MILE Lexical Objects: Syntactic Layer hasSyntacticFrame MLC:SyntacticFrame 1..* MLC:SynU hasFrameSet MLC:FrameSet * composedby MLC:Composition * correspondTo MLC:SemU * MLC:CorrespSynUSemU Pisa, September 2004

  40. … expanding one node. … SynU … SyntacticFrame Construction Self Slot Slot Function Phrase Pisa, September 2004

  41. MILE Lexical Objects: Semantic Layer belongsToSynset MLC:Synset * MLC:SemU hasSemFrame MLC:SemanticFrame 0..1 hasSemFeature MLC:SemanticFeature * hasCollocation MLC:Collocation * semanticRelation MLC:SemU * MLC:SemanticRelation Pisa, September 2004

  42. MILE Lexical Objects: Synt-Sem Linking hasSourceSynu MLC:SynU MLC:CorrespSynUSemU 1 hasTargetSemu MLC:SemU 1 MLC:PredicativeCorresp hasPredicativeCorresp 1 MLC:SlotArgCorresp IncludesSlotArgCorresp 0..* Pisa, September 2004

  43. SynU SemU Frame Predicate Slot0 Arg_0 Slot1 Arg_1 filters & conditions Syntax-Semantics Linking CorrespSynUSemU PredCorresp Slot0:Arg1 Slot1:Arg0 Pisa, September 2004

  44. Syntax-Semantics Linking John gave the book to Mary John gave Mary the book SynU#1 SemU#1 subj_NP obj_NP obl_PP_to Semantic_Frame:GIVE Arg2 Theme Arg3 Goal Arg1 Agent SynU#2 subj_NP obj_NP obj_NP Pisa, September 2004

  45. Syntax-SemanticLinking in SIMPLE SynU_migliorare Intransitive structure Slot0 Ø Transitive structure Slot0 Slot1 Frameset CorrespSynUSemU CorrespSynUSemU isomorphic non-isomorphic SlotArgCorresp SlotArgCorresp PRED_ migliorare ARG0:Agent ARG1:Patient SemU1_migliorare SemU2_migliorare Pisa, September 2004 CAUSE_CHANGE_OF_STATE CHANGE_OF_STATE

  46. The Multilingual layer hasMUMUCorr MUMUCorresp 1..0 MultiCorresp hasSynUSynuCorr SynUSynUCorresp 1..0 hasSemUSemUCorr SemUSemUCorresp 1..0 hasSynsetMultCorr SynsetMultCorresp 1..0 hasSemFrameCorr SemanticFrameMultCorresp 1..0 Pisa, September 2004

  47. MILE approach to multilinguality • Open to various approaches • transfer-based • monolingual descriptions are used to state correspondences (tests and actions) between source and target entries • interlingua-based • monolingual entries linked to language-independent lexical objects (e.g. semantic frames, “primitive predicates”, etc.) Pisa, September 2004

  48. The Multi-MILE • Multi-MILE specifies a formal environment to express multilingual correspondences between lexical items • Source and target lexical entries can be linked by exploiting (possibly combined) aspects of their monolingual descriptions • monolingual lexicons act as pivot lexical repositories, on top of which language-to-language multilingual modules can be defined Pisa, September 2004

  49. The Multi-MILE • Multi-MILE may include: • Multlingual operations to establish transfer links between source and target mono-MILE • Multlingual lexical objects • enrich the source and target lexical descripotions, but • do not belong to the monolingual lexicons • Language-independent lexical objects: • Primitive semantic frames, “interlingual synsets”, etc. • Relevant for interlingua approaches to multilinguality Pisa, September 2004

  50. SemU_1 SemU_2 SemU_1 SynU_1 SynU_2 SynU_1 MU_1 MU_1 English mono-MILE Italian mono-MILE IT-to-EN multi-MILE Multi-MILE IT_SemU_2  En_SemU_1 IT_SynU_2  En_SynU_1 IT_Slot_0 EN_Slot_1 IT_Slot_1  EN_Slot_0 AddFeature to source SemU +HUMAN AddSlot to target SynU MODIF [PP_with] Pisa, September 2004

More Related