1 / 31

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

An Infrastructure of Language Resources & Language Technologies: Why we need it?. Priorities & Challenges. Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. What are we (LT& LR) assembling, …. since many years?. Lexicons & their Ontologies

miriam
Download Presentation

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Infrastructure of Language Resources & Language Technologies: Why we need it? Priorities & Challenges Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  2. What are we (LT& LR) assembling, …. since many years? • Lexicons & their Ontologies • Written, Spoken, ItalWordNets, PAROLE/SIMPLE, … • Annotated corpora/Treebanks • Basic Tools • Integrated Architecture for • Annotation at various levels (from morph. to conceptual) • Acquisition/learning • Classification • Ontology creation • … Standards … a very large infrastructure of LRs & LT • Methodologies • Know how & expertise • Infrastructural bodies(on which to build) Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  3. History: Some international LRs initiatives • EuroWordNet • MATE • NITE • Cluster 488 (Italian) • TAL (Italian) • ISLE • ENABLER • INTERA • … • SENSEVAL • WRITE • Forum TAL (Italian) • … • LIRICS • ISO • ELRA • LREC • LRE Journal • NEDO • … • ACQUILEX [since ’88] • MULTILEX • ET-7 • ET-10 • TEI • NERC • RELATOR • ONOMASTICA • MULTEXT • COLSIT • LSGRAM • DELIS • EAGLES • PAROLE • SIMPLE • SPARKLE • ELSNET EU at the forefront in the areas of LRs and standards in the ’90s Essential role of EC to start a basic Infrastructure Established a model Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  4. Today: a broad “potential” Infrastructure Vitality &Success signs… for LRs RELATOR • EAGLES/ISLE • ENABLER • ELSNET • TELRI • INTERA • … • LIRICS • ELRA • BLARK • Unified Lexicon (W/S) • LREC • LRE journal • … • ERANET-LangNet • … EU Internat LDC & others ISO COCOSDA/WRITE US Cyberinfrastructure Japan COE21 NEDO … Cooperative initiatives – Links to… National … … … CLARIN (ESFRI proposal) Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  5. WordNets Synsets linked by semantic relations TOP Concepts: Object,Artifact,Building Hyperonym:{edificio,..} {home,domicile,..} {house} {Casa,abitazione,dimora} Role_location:{stare, abitare, ...} Hyponym: {villetta } {catapecchia, bicocca, .. } {cottage} {bungalow } Role_target_direction:{rincasare} Role_patient:{affitto, locazione} Mero_part:{vestibolo} {stanza} Holo_part:{casale} {frazione} {caseggiato} Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  6. Terminological Wordnets: e.g. Jur-WordNet • Jur-WordNetðExtension for the juridical domain of ItalWordNet (With ITTIG-CNR - Istituto di Teoria e Tecniche dell’Informazione Giuridica) • Knowledge base for multilingual access to sources of legal information • Source of metadata for semantic markup oflegal texts • To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc. Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  7. PAROLE- SIMPLE-CLIPS Lexicon: …harmonised model for 12 European languages Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  8. Top Telic Formal Constitutive Agentive Is_a Is_a_part_of Property Created_by Agentive_cause Indirect_telic Purpose ... Contains ... Instrumental Is_the_habit_of Used_for Used_as Semantic Relations .. Activity .. .. 100 Rels. • The targets of relations identify: • prototypical semantic information associated with a SemU • elements of dictionary definitions of SemUs • typical corpus collocates of the SemU For a BioLexicon Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  9. Domain - Semantic class mangiare Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  10. +edible Used_for Object_of_the_aactivity TELIC Is_the_activity_of AGENTIVE Created_by Domain - Semantic class zucchero mangiare NATURAL_SUBSTANCE alloro FLAVOURING tartufo cucinare cuocere VEGETAL_ENTITY friggere mestolo mangiare cucinare mangiare mangiare mangiare mangiare mangiare cucinarecuocerearrostirebollirelessarestufarefriggere rosolaregrigliare…… bollire mangiare pentola mangiare friggitrice carne tavola forchetta ristorante mela posata BUILDING carota cuoco coniglio FURNITURE bollitore FOOD pesce FRUIT arrosto VEGETABLES pesciera SUBSTANCE_FOOD INSTRUMENT CONTAINER PROFESSION Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006 ARTIFACT _FOOD

  11. These dimensions In the ’90s:there was a global vision of the field & its main components: Standards, Creation of LRs, Automatic acquisition, Distribution Today: the wealth of data & basic technology is such that we should reflect again at the field as a whole & ask ifthese are still “the” important components, or how they have changed/must change could be at the basis of a new Paradigm for LRs & LT & of a new Infrastructure … Which new challenges for a mature infrastructure of LRs & LT?? • Content interoperability Need tools • Collaborative creation & Manag. • Dynamic LRs Technology exist • Sharing + • Distributed architectures Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  12. Challenges & Priorities for LRs with technological and/or organisational/political aspects • Basic LR coverage for all languages(BLARK/ELARK) • Specific (new) types of LRs: opinion, sentiment, emotion, subjectivity; • “Example-based” context sensitive LRs, Lexicon & Corpus together, dynamically created, new ways to extract value from large linguistic repositories : Web exploited as a multilingual corpus • Tools to quickly develop LRs (acquisition, annotation, porting betw. domains/languages);Coordinate the development of LTs & LRs (also across languages) • Knowledge transfer across languages; Maintenance of LRs • Cooperation betw. communities of HLT & Semantic Web/Ontologists • 'Open Source'concept for LRs & LT, Open & distributed architectures for LRs and LT, wiki-mode?Collaborative Infrastructures Interoperability & Standards • GRID technology • … Multilinguality Unifying frameworks Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  13. LT & “new” topics • Subjectivity, opinion, sentiment, emotion: orthogonal issue wrt objective content. Detection /separation of subjective from objective content, opinion mining, extraction of positive & negative perceptions, have obvious and big impact in many applications, e.g. business intelligence • Commonsense understanding with major implications • allow commonsense reasoning/inference: plausible vs logical, for fail-soft applications • can be pursued in distributed and collaborative fashion by the community as a whole • relation of this with how an agent might put together SW services to accomplish high–level goals for the user • Temporal structure for which de facto standards are emerging (TimeML) • Integration of text, speech and gesture • Strategies for handling miscommunication • Hybrid approaches, Interdisciplinary approaches • … Multimodality Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  14. In the Semantic Web vision ... …need to tackle the twofold challenge of • content availability& • multilinguality • Natural convergence with HLT: • multilingual semantic processing • ontologies • semantic-syntactic computational lexicons Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  15. Issues in LR & LT research agendaconverging with Semantic Web needs From LT: • Meaning & content  Knowledge • Semantic markup: Concept-based Text representation • Semantic lexicons/ Terminologies/ Ontologies To create a web of metadata Viceversa, from SW: • LRs as web services • Ontologies for LRs & LT • Collaborative & distributed infrastructure; open access • Interoperability & standards to add meaning to Web data & make it usable for processing, mining, add spatial & temporal metadata, … Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  16. Ontologies Computational Lexicons Knowledge Markup Linguistic Markup Computational Lexicons:challenges from the Semantic Web The Semantic Web Vision turning the WWW into a machine understandable knowledge base Documents Intelligent Agents Semantic Web Applications Databases Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  17. Ontologies and Computational Lexicons Ontology Concept Space Semantics polysemy, context-sensitiveness, etc. Syntax Multilinguality Morphology Language/s Computational Lexicon Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  18. Ontology Learning term extraction from text {museo, quadro, pinacoteca, biblioteca, sito_archeologico, museo_archeologico, museo_etrusco, scultura, affresco, …} TL+ML conceptual clustering of terms C_MUSEO: {museo, pinacoteca, …} C_MUSEO_ARCHEOLOGICO: {museo_archeologico, museo_etrusco, …} C_OPERA_ARTISTICA: {quadro, scultura, affresco, …} concept structuring C_MUSEO is_a Ontology C_MUSEO_ARCHEOLOGICO Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  19. Ontology Learning in T2Kfrom thesaurus to conceptual map Identification of horizontal relations among terms through the events which better characterise them events - situations involving domain entities Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  20. Tools for terminology extraction For Applications: Semantic/Conceptual Annotation of Texts Module of analysis of Italian Tools for Annotation of the logical structure Structured Knowledge Reference Lexical Resources LOGICAL FORM Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  21. Language Tech … & … Knowledge, Content Ready??? Knowledge Markup Hum&SS LT & LRs Semantic Web How to cooperate?? Content Interoperable LRs & LT Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  22. A new paradigm of R&D in LRs & LT • Open & distributed linguistic infrastructures for LRs & LT • adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, based on sharing LRs & tools • ability to build on each other achievements, results accessible to various systems, allowing controlled & effective cooperation of many groups on common tasks (see HGP  HLP) • Emerging concept of collective intelligence • Emphasize interoperabilityamong LRs, LT & knowledge bases • e. g. initiatives aimed at achieving international consensus on annotation guidelines: to merge annotation efforts, produce coherent, comprehensive linguistic annotations to be readily disseminated throughout the community Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  23. ISO & LIRICS: Meta-model & Data Categories Builds also on EAGLES/ISLE e.g. Proposal for an ISO standard for NLP lexica • Define a Lexical Markup Framework, a general & abstract meta-model & a set of structural nodes relevant for linguistic description • Define a flexible environment, enabling specific implementations of user-defined mark-up languages (called LML) on the basis of common DCs Objectives • Design of the abstract lexical meta-model • Definition of the common set of related Data Categories The field is mature Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  24. MILE Lexical ModelData Categories for Content Interoperability Multilingual ISLE Lexical Entry MILE Entry Schema MILE Lexical Classes RDF/S Descriptions MDC Registry User Defined MDC LIRICS Monolingual/Multilingual Lexicon ISO TC37 SC4/WG4 NEDO Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  25. Beyond MILE:towards open & distributed Lexicon Infrastructure Language Knowledge Ontology URI = http://www.zzz… Semantic Lexicon URI = http://www.xxx… Syntactic Constructions URI = http://www.yyy… Lex_object: semFeature URI = http://www.xxx…#HUMAN Lex_object: syntagmaNT URI = http://www.zzz…#NP …towards the Semantic Web Corpora/ Web Monolingual/Multilingual Lexicons Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  26. Lexical WEB & Standards forContent Interoperability … still open • as a critical step for semantic mark-up in the SemWeb NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE MILE Lex_x FrameNet Lex_y Standards for Interoperability Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  27. Open distributed architectures for LRs and LT, interoperability, GRID technology, … & standards Towards:Large online “open source” collaborative projects e-Science: • GRID technology for large-scale distributed collaborative processing of huge quantities of “facts & their relations”(development of large-scale annotated LRs, linking them across different sources, …)  problem of how to coordinate different information sources • new ways of extending large-scale LRs and knowledge bases relying on volunteer labour, wiki-mode? interoperability Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  28. Need of tools to make this vision operational & concrete E.g. new prototype built in Pisa (http://xmlgroup.iit.cnr.it:98/MILE/lexflow/demo.xhtml): • LeXFlow, a web-based collaborative environment for semi-automatic management of lexical resources • Is intended to fulfil the requirements posed by innovative types of LRs by supporting: • Dynamic language resources, integrating tools for automatic acquisition of information from corpora and cross-fertilization of lexicons • Content interoperability of resources, by supporting ISLE/ISO standards • Cooperative & collective creation and management of LRs, by providing a web-based environment for the collaboration and interaction of distributed agents and resources Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  29. Why an infrastructure of LRs? • Because what is special in Language data … • … is what is more difficult wrt hard sciences, i.e.  “language” and its “ambiguity” Already in the ENABLER Mission: Putting together technical, organisational, strategic, political issues of LRs Availability of LRs also a “sensitive” issue, touching the sphere of linguistic & cultural identity, but also with economical & political implications Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  30. Why an infrastructure of LRs? Many dimensions around the notion of language Political issues e.g. a commonly agreed list of minimal requirements for “national” LRs: BLARK Putting togethertechnical, organisational, strategic, political issues of LRs • Cultural issues • Language … and cultural identity • Language … and the Humanities Multilingualism Need of bodies for a broad research agenda & strategic actions for LT&LRs (W/S /MM) Interdisciplinarity & Multidisciplinarity • Economic, social issues • Applications • Services Technical issues Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

  31. Technologies exist, but the infrastructure that puts them together and sustains them is still missing Which Communities? for • Humanities • Social Sciences • Digital Libraries • Cultural Heritage • … • Language Resources • Language Technology • Standardisation core Enabling infrastr Multilinguality on • Grid • Semantic Web • Ontologists • ICT • … Focus on cooperation • Many application domains (eculture, egovernment, ehealth, …) for Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006

More Related