1 / 54

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

Words about words and about data: lexicons and metadata, language resources and infrastructures. … with some historical notes to show a path along an evolving vision . Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it.

kineks
Download Presentation

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Words about words and about data: lexicons and metadata, language resources and infrastructures … with some historical notes to show a path along an evolving vision Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it Copenhagen, December 2010

  2. Old slide with Antonio Zampolli (’80s/early ‘90s) Why such needed LRs, are lackingafter 30 years of R&D in the field?  1) Because the main trend until mid-’80s was to privilege the processing of so-called“critical” phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language • As a result CL was focusing on: • few examples - often artificially built • lexicons made of few entries (toy lexicons) • grammars with poor coverage 2) Because large-scale LRs are costly & their production requires a big organizing effort Copenhagen, December 2010

  3. Historical notes … back from the late ’70s/‘80s Automatic acquisition of lexical information from MRDs Pioneering Research It became evident that: • Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, • were unmanageable at the formal representation level, and had to be blurred into unique features and values • Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries • Was my first research & became central in the Pisa group (ACQUILEX) • And alsoAmsler, Briscoe, Boguraev, Wilks’ group, IBM, then Japanese groups, … • The trend was: “large-scale computational methods for the transformation of machine readable dictionaries into machine tractable dictionaries” • Instead of relying on linguists’ introspection Copenhagen, December 2010

  4. Historical notes Automatic acquisition of info from texts: … back from the late ‘80s After acquisition from MRDs, This trendhas become today a consolidated & pervasive fact • From acquisition of “linguistic information” • To acquisition of “general knowledge”, with more data intensive, robust, reliable methods Copenhagen, December 2010

  5. PAROLE-SIMPLE-[CLIPS]: 4 levels of linguistic description Complement properties Complement properties a. Entry properties b. Subcat frame a. Entry properties b. Subcat frame Predicate Semantic Restrictions Arguments Accent position Vowel openness Consonant pronunciation Phonological Unit Correspondence Ufon.-Umorf. Grammatical Cat. Subcat. Inflectional Paradigm Morphological Unit Correspondence Umorf.-Usint. Syntactic structure 1 Link Syntactic Unit Syntactic structure 2 Syntax - Semantics Correspondence Correspondence Usint.-Usem Ontological type Event type Semantic features Semantic Unit Semantic relations Extended Qualia Structure Regular polysemy Link type Predicative representation Copenhagen, December 2010

  6. Semantic entry ontologicaltype example free definition eventtype domain information qualiafeatures semantic relations ExtendedQualiaStructure regular polysemy predicative representation • semantic predicate: PRED_vaporizzare-1 • type of link: instrument nominalization • arguments description: • range • semantic role • select. restrictions arg0_vaporizzare_1 Protoagent Human/Instrument arg1_vaporizzare_1 Protopatient +liquid arg2_vaporizzare_1 Location Concrete_entity semantic entry:vaporizzatore(spray) semantic type: instrument supertype: artifact eventype: ===== domain: cleaning, gardening, cosmetics semantic class: apparatus synonymy: nebulizzatore morpho. derivation: eventverb vaporizzare formal: vaporizzatore isa apparecchio constitutive: vaporizzatore has_as_part pulsante agentive: vaporizzatore created_by fabbricare telic: vaporizzatore used_for atomizzare regular polysemy: ===== semantic predicate: PRED_vaporizzare type of link: instrument nominalization arguments description: range semantic role select. restrictions arg0 Protoagent Human/Instrument arg1 Protopatient +liquid arg2 Location Concrete_entity USem3527vaporizzatore semantic type: Instrument unification_path: [Concrete_entity | ArtifactAgentive | Telic] apparecchio usato per vaporizzare un vaporizzatore per piante eventype: ===== cleaning, gardening, cosmetics USem3527vaporizzatore synonymy USem72288nebulizzatore USem3527vaporizzatore instrumentverb Usem5239vaporizzare ===== USem3527vaporizzatoreisa Usem3479apparecchio USem3527vaporizzatorehas_as_part Usem61633pulsante USem3527vaporizzatorecreated_by UsemD387fabbricare USem3527vaporizzatoreused_forUsemD66019nebulizzare regular polysemy: ===== Very innovative… too much?? Copenhagen, December 2010 from Nilda Ruimy

  7. TOP CAUSE • PART • GROUP • AMOUNT • Location • Material • Artifact • Food • Physical Object • Organic Object • Living Entity • Substance • Quality • Psych Property • Physi Property • Social Property • Domain • Time • Moral Standards • Cognitive Fact • Mvmt of Thought • Institution • Convention • Abstract Location • Language • Sign • Information • Number • Unit of measure • Metalanguage • Human • Animal • Vegetal Entity Phenomenon Aspectual State Act Psychological_event Change Cause_change • Cognitive Event • Experience Event • Rel. Change • Change Possession • Change Location • Natural Transition • Acquire Knowledge • Cause Rel. Change • Cause Change Location • Cause Natural Transition • Creation • Give Knowledge • Weather verbs • Disease • Stimuli Cause Aspect. • Exist • Rel. State • Non Rel. Act • Relational Act • Move • Cause Act • Speech Act The SIMPLE Ontology TELIC CONSTITUTIVE AGENTIVE ENTITY CONCRETE_ENTITY PROPERTY ABSTRACT_ENTITY REPRESENTATION EVENT • Artifact Material • Furniture • Clothing • Container • Artwork • Instrument • Money • Vehicle • Semiotic Artifact Multidimensionality from Nilda Ruimy Copenhagen, December 2010

  8. Constitutive: made_of Formal: isa Agentive: created_by Constitutive: made_of Telic: Used_for Constitutive: contains Meaning dimensions expressed by Qualia relations botte barrel recipiente di legno traditional dictionary definition fatto di doghe arcuate tenute unite da cerchi di ferro che serve per la conservazione e il trasporto di liquidi, specialmente vino From Nilda Ruimy Copenhagen, December 2010

  9. Constitutive Agentive Telic Formal made_of is_a_follower_of has_as_member is_a_member_of has_as_part instrument kinship is_a_part_of resulting_state relates uses causes concerns affects constitutive_activity contains has_as_colour has_as_effect has_as_property measured_by measures produces produced_by property_of quantifies related_to successor_of precedes typical_of feeling is_in lives_in typical_location used_for used_as used_by used_against result_of agentive_prog agentive_cause agentive_experience caused_by source created_by derived_from A G E N T I V E C O N S T I T U T I V E is_a antonym_comp antonym_grad mult_opposition INSTRUMENTAL indirect_telic purpose TELIC ARTIFACTUAL AGENTIVE is_the_activity_of is_the_ability_of is_the_habit_of ACTIVITY DIRECT TELIC object_of_activity “Extended” Qualia Structure P R O P E R T Y T-cell, Blood Stem Cell Ribose, Nucleotide Catalyze, Enzyme regulates is_regulated_by ….. NEW! Copenhagen, December 2010 LOCATION

  10. Ontologisation of SIMPLE • Automatically converting and enriching a computational lexicon into a formal Ontology • For NLP semantic tasks • Potential of ontologies in NLP as Backbone in LKBs • Pivot in multilingual architectures (e.g. KYOTO) • Reasoning capabilities • Ontologisation of SIMPLE into OWL • Conversion of the SIMPLE ontology • Bottom-up enrichment: promoting lexicon knowledge to the ontology level • Language independent knowledge from Italian lexico-semantic information Copenhagen, December 2010 from Antonio Toral

  11. Use of SIMPLE Lexicon & Ontology for Time and Event detection/annotation (TimeML) Different PoS may realise an event: verbs, nouns, adjectives, prep. phrases The SIMPLE Lexicon helps in identifying & classifying Events (eventive nouns & adjectives) →in a 10K Words Annotation Experiment • each event is associated with an Ontological Type • the Event-Type from the SIMPLE-Ontology can be used as default value to provide event composition, and consequently to instantiate a temporal representation for each Event • improvement both in identification & classification of Events by annotators: 81.17% accuracy (vs.72.35%) and K-coefficient = 0.84 (vs. 0.7) SIMPLE Lexicon Morpho-Syntactic Analysis Event Detection & Classification Copenhagen, December 2010 from Tommaso Caselli

  12. GLML – Generative Lexicon Markup Language with James Pustejovsky, Olga Batiukova, Anna Rumshisky, Marc Verhagen • Annotate texts with Argument Selection, Argument Coercion, & Qualia Roles • The corpus brings reality to the model, provides statistical cues to improve language models • Lexical semantic info, like type coercion/selection, required for applications such as WSD, categorisation, IR (query reformulation, filtering…), IE (coreference resolution, relation extraction…), entailment, … Predicate – Argument constructions • Predicate Sense Disambiguation • Argument selection: type selection /coercion • Qualia role/relation selection • Modification constructions • Noun Sense Disambiguation • Qualia role/relation selection in Adjectival Modification • Qualia role/relation selection in Nominal Modification • Complex Types • Type selection in modification of Dot Objects Copenhagen, December 2010 from Valeria Quochi

  13. Between LRs and Linguistics: Consequence of the corpus-based approach • Break hypotheses too easily taken for granted in mainstream linguistics • Many propertiesbehave as a continuum, not as “yes/no” properties • So-called “rules” are simplifications or idealisations dispelled by real usage, more frequently“tendencies” towards a rule Lesson learned Copenhagen, December 2010

  14. Looking into the past After the “Grosseto Workshop” (1985): a turning point Xxx-Lex GeneLex AcquiLex MultiLex Xxx-Lex A. Zampolli: Let’s be coherent: • EAGLES • ISLEStandards, Best Practices, ... All started with the situation we had in the late ‘80s – early ‘90s With all the Xxx-LEX projects Copenhagen, December 2010 14

  15. Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Objects Sem feature syntactic frame slot Syn feature phrase MILE – Modularity The building-block model Allow to express different dimensions of lexical entries Enable modular specification of lexical entries Create ready-to-use packages to be combined in different ways Copenhagen, December 2010

  16. MILE Lexical Data Category RegistryA library of pre-instantiated objects • Define (an ontology of) lexical objects • represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. • specify the relevant attributes • define the relations with other classes • hierarchically structured • Can be used “off the shelf” or as a departure point for the definition of new or modified categories Copenhagen, December 2010

  17. Builds on EAGLES/ISLE ISO LMFLexical Markup Framework The field is mature Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions • Modular framework • LMF specs comply with modelling UML principles • an XML DTD allows implementation New initiatives … LIRICS ICT KYOTO NEDO Asian Lang.uages LexInfo NICT Language-Grid Service Ontology Copenhagen, December 2010

  18. Principles of LMF: from very simple lexicons … Copenhagen, December 2010

  19. Copenhagen, December 2010 to very rich ones … DCR

  20. BioLexiconSIMPLE model & ISO-LMF standard A unique large-scale computational lexicon in the biomedical domain in terms of coverage & typology of information BL Populated with info from available biomedical resources Including both domain-specific & general language words Semi-automatically populated from corpora: from Monica Monachini Copenhagen, December 2010

  21. KYOTOwww.kyoto-project.nl A search environment that uses semantic technologies A “compass” for the web2.0 • Interdisciplinarity • scientific community (LRT, web technologies, knowledge engineers), companies, domain experts • Multilingualism • 7 languages (2 Asiatic languages) Kyoto Core System is distributed under free open source (GPL) licence The lexical resource perspective: • needs to share lexical & knowledge bases • both general& domain-related • under the form of lexical repositories and ontologies Copenhagen, December 2010 from Monica Monachini

  22. Source Documents KYOTO SYSTEM Linear MAF/SYNAF Term extraction Tybot Semantic annotation Generic TMF Linear SEMAF Fact extraction Kybot Domain editing Wikyoto Fact User Concept User LMF API OWL API Linear Generic FACTAF Domain Wordnet Domain ontology Wordnet Ontology from Piek Vossen Copenhagen, December 2010

  23. Annotation Format (KAF) Multi-level Annotation Format • stand-offannotation • uniformrepresentation for 7 languages • Text: tokenisation, sentences, paragraphs with reference to the sources • Terms: words & multi-words, parts-of-speech, etc. • Chunks: constituents & syntagmatic realization • Dependencies: grammatical functions • L1 – Semantic modules: Multiword tagging, Sense Tagging, Named Entity Recognition, OntoTagging • L2 – Semantic module: event/fact extraction • Shared through the languages Copenhagen, December 2010

  24. Collaborative Platform • Automatic extraction of concepts • Validation & enrichment of domain WordNets • Anchored to the ontology by domain experts • In a collaborative platform Italian, English KYOTO Knowledge Base Dutch, Chinese Spanish, Basque, Japanese KYOTO Multilingual Knowledge Base KYOTO Central Ontology Wikyoto Knowledge Editor Linguistic Processors Term Extractor Domain Expert Users Copenhagen, December 2010

  25. WnJP Wn IT WnNL WnEN WnES WnJP WnCH WnEU Wn IT WnNL WnEN WnES WnEU WnCH A common representation format for WordNets Seven WordNets • similar but not identical  hampered interoperability • to be accessed both intra- and inter-linguistically  to support easier integration • endow WordNet with a representation format allowing easy access, integration & interoperability among resources Copenhagen, December 2010

  26. A common representation format: WordNet - LMF Data Categories LexicalResource 1..* 0..1 1..1 GlobalInformation Lexicon SenseAxes 1..* 0..* 1..* 0..1 Meta Synset SenseAxis LexicalEntry 0..1 0..1 0..* 0..1 0..1 1..1 MonolingualExternalRefs InterlingualExternalRefs Lemma Sense Definition SynsetRelations 0..1 0..* 1..* 1..* 1..* MonolingualExternalRefs MonolingualExternalRef InterlingualExternalRef Statement SynsetRelation 0..1 0..1 0..1 1..* MonolingualExternalRef Meta Meta Meta 0..1 Meta from Monica Monachini Copenhagen, December 2010

  27. Centralized WordNet DC Registry A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid Inter-WN Intra-WN Copenhagen, December 2010

  28. WordNet-LMF Multilingual level - Cross-lingual Relations <!ELEMENT SenseAxes (SenseAxis+)> <!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)> <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ELEMENT Target EMPTY> <!ATTLIST Target ID CDATA #REQUIRED> <!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)> <!ELEMENT InterlingualExternalRef (Meta?)> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN <fuoco_1, fiamma_1> 00001251-n SWN <fuego_3, llama_1> 09686541-n groups monolingual synsets corresponding to each other and sharing the same relations to English WN3.0 <fire_1 flame_1 flaming_1> 13480848-n specifies the type of correspondence link to ontology/(ies) Copenhagen, December 2010 from Monica Monachini

  29. Mining facts Copenhagen, December 2010

  30. Copenhagen, December 2010

  31. Copenhagen, December 2010

  32. Complex picture!Is there anything we need to do for Interoperability? ISO SC 4/WG 4 – Lexicon-Ontology relationsPWI 24622 Work within ISO: • LMF: abstract meta-model for lexical representation • Ontology Group or more Groups? • Language Resource Ontologies: ontology of data categories Real life: • Lexicons (e.g. WordNets) that are called Ontologies • Lexicons linked to Ontologies: to be used in applications, in multilingual systems, domains, … • Work on “ontologising” Lexicons: to allow exploiting various relations, to make inferences, … • Semantic Lexicons, with many types of relations among semantic units: these are often of “conceptual/world-knowledge” nature. Do we want DCs for these? Copenhagen, December 2010

  33. To explore the need of doing something within ISO about the relations between Lexicon and Ontology Do we/ISO need to address another (lexical) layer? • How lexicons and ontologies are linked and information mapped from one to the other • The ontological layer in a/connected to a lexicon Possible issues/questions: • Is LMF enough to represent Ontological links? • How to connect work being done in ISO Lexical group and ISO Ontology groups? • Lexicon and Ontologies: separation? or lexicalised ontologies? or ontologised lexicons? • Lexicon, Ontologies and Domains • On a very different dimension: Ontology of lexical/semantic/conceptual categories? Standardised semantic categories, ontology labels? • Relation to multilinguality • ... Copenhagen, December 2010

  34. Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results e. g. initiatives to achieve international consensus on annotation guidelines Since few years A new paradigm of R&D in LRs & LT • Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject) Copenhagen, December 2010

  35. Some steps for a “new generation” of LRs Copenhagen, December 2010

  36. Standards Lexical WEB Global WordNet GRID NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE-WEB SIMPLE LMF Lex_x BioLexicon FrameNet Lex_y Standards for Content Interoperability Enough?? Copenhagen, December 2010

  37. Distributed Language Services A scenario implying: Enabling: Copenhagen, December 2010

  38. Which Communities? Many LRs & LTs exist, but a global vision, policy & strategy is needed for • Language Resources • Language Technologies • Standardisation • Content/Ontologies • System developers • Integrators • SSH • Many applications/domains • MT • CLIR • … • e-government • content industry • intelligence • e-culture • e-health • domotics… core EU Forum CLARIN for SSH FLaReNet Network META-NET NoE • Need to consider together • technical • organisational • strategic • economic, social • cultural • legal • political issues wrt LRs & LTs with • EC • National funding agencies • Industry Copenhagen, December 2010

  39. Fostering Language Resources NetworkFLaReNet at a glance http://www.flarenet.eu Ambitious! • A “roadmap”: a plan of actions as input to policy development • A (Eu) model for the LRs/LTs area of the next years Essential Community mobilisation (also to prepare the ground for a RI) An international Forumto facilitate interaction, to • Overcome the fragmentation in LR & LT & recreate a community • Anticipate the needs of new types of LR & LT & Language Infrastructures • Create a shared policy for the next years • Foster a European strategy for consolidating the sector Copenhagen, December 2010 39

  40. Results from Vienna & Barcelona Forum: Shaping the Future of the Multilingual Digital Europe • Create a shared repository of data formats, annotations, etc. as a major help to achieve standardisation • Common repositories for tools & language data should be established that are universally and easily accessible by everyone • Coordinate input to ISO/W3C standardisation work Access to LRs is critical & should involve all the community Need to create the means to plug together different LR & LT, In a web-based resource and technology “grid” Copenhagen, December 2010

  41. 2nd Blueprint http://www.flarenet.eu/sites/default/files/D8.2b.pdf • Result of a permanent and cyclical consultation • Inside the community it represents • Outside it, through connections with neighbouring projects, associations, initiatives, funding agencies • Organised along three main “directions”: • Infrastructural Aspects • Research and Development • Political and Strategic Issues • Reflect three major development factors that can boost or hinder the growth of the field of LRT Copenhagen, December 2010

  42. Infrastructural Aspects Copenhagen, December 2010

  43. 2nd FLaReNet Forumin Barcelona – February 2010 The European Language Resources and Technologies Forum: Language Resources of the Future The Future of Language Resources Important role in defining recommendations 120 Participants 22 Countries Proceedings & Reportson the web Blueprint to FLaReNet Institutional Members for adoption & endorsement Copenhagen, December 2010

  44. The Project META-NET META-NET is a Network of Excellence dedicated to fostering the technological foundations of the European multilingual information society Objectives: Prepare the ground for a large-scale concerted effort by building a strategic alliance of national and international research programmes, corporate users and commercial technology providers and language communities Strengthen the European research community through research networking and by creating new schemes and structures for sharing resources and efforts Build bridges by approaching open problems in collaboration with other research fields such as machine learning, social computing, cognitive systems, knowledge technologies and multimedia content Final goal:META – The Multilingual Europe Technology Alliance Copenhagen, December 2010

  45. Data has become a key factor in LT R&D • A few indicators: • Increasing size and importance of the LREC conference, corpora mailing list etc. • Citation ranks of publications on language resources • Language research and language technology belong to the Data Intensive Sciences • Expensive data become valuable through sharing • However, the long demanded and well-contemplated instruments for managing and sharing this data are still missing Copenhagen, December 2010

  46. META-SHARE: Key Features • META-SHARE is an open, integrated, secure, interoperable exchange infrastructure for language data and tools for the Human Language Technologies domain • A marketplace where language data and tools are documented, uploaded and stored in repositories, catalogued and announced, downloaded, exchanged, aiming to support a data economy (includes free and for-a-fee LRs/LTs and also services) • Standards-compliant, overcoming format, terminological and semantic differences • Based on distributed networked repositories accessible through common interfaces Copenhagen, December 2010

  47. What we’re offering • A channel to share and distribute language data and tools • Technical solutions for building your own repositories • Protocols and mechanisms for making the descriptions of your resources (and the actual resources) harvestable • Guidelines and recommendations on standards used in the LR production and documentation processes • Recommendations on data and tools licensing issues • Access to large catalogues of documented, high-quality resources, as well as the actual data and tools Copenhagen, December 2010

  48. META-SHARE is a big step thatneeds a real Paradigm shift • LR building as collaborative “common shared task” • New methodology of work • Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map) • Interoperabilityacquires even more value • Needs consensual planning of common strategies towards shared objectives • Not just the sum of many individual efforts • But an organised, well-structured, collective enterprise • Similar to more mature sciences: Physicists/Astronomers’s experiments … of X,000 people working on the same big enterprise Copenhagen, December 2010

  49. LRE Map: Why?? • www.resourcebook.eu The Map as an answer to start to fill this gap, but also: • To encourage the needed “change in culture” Problem: • Lack of information & documentation about resources is, in the e-science paradigm, a very critical issue • Non documented resources don’t exist!! • A collective enterprise: Each researcher must become aware of the importance of his/her personal engagement in documenting resources • A task as important as creating new resources and not an accessory to be disregarded • As the necessary service to the whole community • Will become an essential instrument to monitor the field Copenhagen, December 2010

  50. How many LRs & Types at LREC? Languages: 170! How many LRs & Types at COLING? Copenhagen, December 2010 50

More Related