310 likes | 441 Views
An Infrastructure of Language Resources & Language Technologies: Why we need it?. Priorities & Challenges. Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. What are we (LT& LR) assembling, …. since many years?. Lexicons & their Ontologies
E N D
An Infrastructure of Language Resources & Language Technologies: Why we need it? Priorities & Challenges Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
What are we (LT& LR) assembling, …. since many years? • Lexicons & their Ontologies • Written, Spoken, ItalWordNets, PAROLE/SIMPLE, … • Annotated corpora/Treebanks • Basic Tools • Integrated Architecture for • Annotation at various levels (from morph. to conceptual) • Acquisition/learning • Classification • Ontology creation • … Standards … a very large infrastructure of LRs & LT • Methodologies • Know how & expertise • Infrastructural bodies(on which to build) Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
History: Some international LRs initiatives • EuroWordNet • MATE • NITE • Cluster 488 (Italian) • TAL (Italian) • ISLE • ENABLER • INTERA • … • SENSEVAL • WRITE • Forum TAL (Italian) • … • LIRICS • ISO • ELRA • LREC • LRE Journal • NEDO • … • ACQUILEX [since ’88] • MULTILEX • ET-7 • ET-10 • TEI • NERC • RELATOR • ONOMASTICA • MULTEXT • COLSIT • LSGRAM • DELIS • EAGLES • PAROLE • SIMPLE • SPARKLE • ELSNET EU at the forefront in the areas of LRs and standards in the ’90s Essential role of EC to start a basic Infrastructure Established a model Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Today: a broad “potential” Infrastructure Vitality &Success signs… for LRs RELATOR • EAGLES/ISLE • ENABLER • ELSNET • TELRI • INTERA • … • LIRICS • ELRA • BLARK • Unified Lexicon (W/S) • LREC • LRE journal • … • ERANET-LangNet • … EU Internat LDC & others ISO COCOSDA/WRITE US Cyberinfrastructure Japan COE21 NEDO … Cooperative initiatives – Links to… National … … … CLARIN (ESFRI proposal) Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
WordNets Synsets linked by semantic relations TOP Concepts: Object,Artifact,Building Hyperonym:{edificio,..} {home,domicile,..} {house} {Casa,abitazione,dimora} Role_location:{stare, abitare, ...} Hyponym: {villetta } {catapecchia, bicocca, .. } {cottage} {bungalow } Role_target_direction:{rincasare} Role_patient:{affitto, locazione} Mero_part:{vestibolo} {stanza} Holo_part:{casale} {frazione} {caseggiato} Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Terminological Wordnets: e.g. Jur-WordNet • Jur-WordNetðExtension for the juridical domain of ItalWordNet (With ITTIG-CNR - Istituto di Teoria e Tecniche dell’Informazione Giuridica) • Knowledge base for multilingual access to sources of legal information • Source of metadata for semantic markup oflegal texts • To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc. Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
PAROLE- SIMPLE-CLIPS Lexicon: …harmonised model for 12 European languages Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Top Telic Formal Constitutive Agentive Is_a Is_a_part_of Property Created_by Agentive_cause Indirect_telic Purpose ... Contains ... Instrumental Is_the_habit_of Used_for Used_as Semantic Relations .. Activity .. .. 100 Rels. • The targets of relations identify: • prototypical semantic information associated with a SemU • elements of dictionary definitions of SemUs • typical corpus collocates of the SemU For a BioLexicon Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Domain - Semantic class mangiare Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
+edible Used_for Object_of_the_aactivity TELIC Is_the_activity_of AGENTIVE Created_by Domain - Semantic class zucchero mangiare NATURAL_SUBSTANCE alloro FLAVOURING tartufo cucinare cuocere VEGETAL_ENTITY friggere mestolo mangiare cucinare mangiare mangiare mangiare mangiare mangiare cucinarecuocerearrostirebollirelessarestufarefriggere rosolaregrigliare…… bollire mangiare pentola mangiare friggitrice carne tavola forchetta ristorante mela posata BUILDING carota cuoco coniglio FURNITURE bollitore FOOD pesce FRUIT arrosto VEGETABLES pesciera SUBSTANCE_FOOD INSTRUMENT CONTAINER PROFESSION Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006 ARTIFACT _FOOD
These dimensions In the ’90s:there was a global vision of the field & its main components: Standards, Creation of LRs, Automatic acquisition, Distribution Today: the wealth of data & basic technology is such that we should reflect again at the field as a whole & ask ifthese are still “the” important components, or how they have changed/must change could be at the basis of a new Paradigm for LRs & LT & of a new Infrastructure … Which new challenges for a mature infrastructure of LRs & LT?? • Content interoperability Need tools • Collaborative creation & Manag. • Dynamic LRs Technology exist • Sharing + • Distributed architectures Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Challenges & Priorities for LRs with technological and/or organisational/political aspects • Basic LR coverage for all languages(BLARK/ELARK) • Specific (new) types of LRs: opinion, sentiment, emotion, subjectivity; • “Example-based” context sensitive LRs, Lexicon & Corpus together, dynamically created, new ways to extract value from large linguistic repositories : Web exploited as a multilingual corpus • Tools to quickly develop LRs (acquisition, annotation, porting betw. domains/languages);Coordinate the development of LTs & LRs (also across languages) • Knowledge transfer across languages; Maintenance of LRs • Cooperation betw. communities of HLT & Semantic Web/Ontologists • 'Open Source'concept for LRs & LT, Open & distributed architectures for LRs and LT, wiki-mode?Collaborative Infrastructures Interoperability & Standards • GRID technology • … Multilinguality Unifying frameworks Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
LT & “new” topics • Subjectivity, opinion, sentiment, emotion: orthogonal issue wrt objective content. Detection /separation of subjective from objective content, opinion mining, extraction of positive & negative perceptions, have obvious and big impact in many applications, e.g. business intelligence • Commonsense understanding with major implications • allow commonsense reasoning/inference: plausible vs logical, for fail-soft applications • can be pursued in distributed and collaborative fashion by the community as a whole • relation of this with how an agent might put together SW services to accomplish high–level goals for the user • Temporal structure for which de facto standards are emerging (TimeML) • Integration of text, speech and gesture • Strategies for handling miscommunication • Hybrid approaches, Interdisciplinary approaches • … Multimodality Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
In the Semantic Web vision ... …need to tackle the twofold challenge of • content availability& • multilinguality • Natural convergence with HLT: • multilingual semantic processing • ontologies • semantic-syntactic computational lexicons Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Issues in LR & LT research agendaconverging with Semantic Web needs From LT: • Meaning & content Knowledge • Semantic markup: Concept-based Text representation • Semantic lexicons/ Terminologies/ Ontologies To create a web of metadata Viceversa, from SW: • LRs as web services • Ontologies for LRs & LT • Collaborative & distributed infrastructure; open access • Interoperability & standards to add meaning to Web data & make it usable for processing, mining, add spatial & temporal metadata, … Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Ontologies Computational Lexicons Knowledge Markup Linguistic Markup Computational Lexicons:challenges from the Semantic Web The Semantic Web Vision turning the WWW into a machine understandable knowledge base Documents Intelligent Agents Semantic Web Applications Databases Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Ontologies and Computational Lexicons Ontology Concept Space Semantics polysemy, context-sensitiveness, etc. Syntax Multilinguality Morphology Language/s Computational Lexicon Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Ontology Learning term extraction from text {museo, quadro, pinacoteca, biblioteca, sito_archeologico, museo_archeologico, museo_etrusco, scultura, affresco, …} TL+ML conceptual clustering of terms C_MUSEO: {museo, pinacoteca, …} C_MUSEO_ARCHEOLOGICO: {museo_archeologico, museo_etrusco, …} C_OPERA_ARTISTICA: {quadro, scultura, affresco, …} concept structuring C_MUSEO is_a Ontology C_MUSEO_ARCHEOLOGICO Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Ontology Learning in T2Kfrom thesaurus to conceptual map Identification of horizontal relations among terms through the events which better characterise them events - situations involving domain entities Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Tools for terminology extraction For Applications: Semantic/Conceptual Annotation of Texts Module of analysis of Italian Tools for Annotation of the logical structure Structured Knowledge Reference Lexical Resources LOGICAL FORM Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Language Tech … & … Knowledge, Content Ready??? Knowledge Markup Hum&SS LT & LRs Semantic Web How to cooperate?? Content Interoperable LRs & LT Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
A new paradigm of R&D in LRs & LT • Open & distributed linguistic infrastructures for LRs & LT • adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, based on sharing LRs & tools • ability to build on each other achievements, results accessible to various systems, allowing controlled & effective cooperation of many groups on common tasks (see HGP HLP) • Emerging concept of collective intelligence • Emphasize interoperabilityamong LRs, LT & knowledge bases • e. g. initiatives aimed at achieving international consensus on annotation guidelines: to merge annotation efforts, produce coherent, comprehensive linguistic annotations to be readily disseminated throughout the community Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
ISO & LIRICS: Meta-model & Data Categories Builds also on EAGLES/ISLE e.g. Proposal for an ISO standard for NLP lexica • Define a Lexical Markup Framework, a general & abstract meta-model & a set of structural nodes relevant for linguistic description • Define a flexible environment, enabling specific implementations of user-defined mark-up languages (called LML) on the basis of common DCs Objectives • Design of the abstract lexical meta-model • Definition of the common set of related Data Categories The field is mature Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
MILE Lexical ModelData Categories for Content Interoperability Multilingual ISLE Lexical Entry MILE Entry Schema MILE Lexical Classes RDF/S Descriptions MDC Registry User Defined MDC LIRICS Monolingual/Multilingual Lexicon ISO TC37 SC4/WG4 NEDO Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Beyond MILE:towards open & distributed Lexicon Infrastructure Language Knowledge Ontology URI = http://www.zzz… Semantic Lexicon URI = http://www.xxx… Syntactic Constructions URI = http://www.yyy… Lex_object: semFeature URI = http://www.xxx…#HUMAN Lex_object: syntagmaNT URI = http://www.zzz…#NP …towards the Semantic Web Corpora/ Web Monolingual/Multilingual Lexicons Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Lexical WEB & Standards forContent Interoperability … still open • as a critical step for semantic mark-up in the SemWeb NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE MILE Lex_x FrameNet Lex_y Standards for Interoperability Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Open distributed architectures for LRs and LT, interoperability, GRID technology, … & standards Towards:Large online “open source” collaborative projects e-Science: • GRID technology for large-scale distributed collaborative processing of huge quantities of “facts & their relations”(development of large-scale annotated LRs, linking them across different sources, …) problem of how to coordinate different information sources • new ways of extending large-scale LRs and knowledge bases relying on volunteer labour, wiki-mode? interoperability Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Need of tools to make this vision operational & concrete E.g. new prototype built in Pisa (http://xmlgroup.iit.cnr.it:98/MILE/lexflow/demo.xhtml): • LeXFlow, a web-based collaborative environment for semi-automatic management of lexical resources • Is intended to fulfil the requirements posed by innovative types of LRs by supporting: • Dynamic language resources, integrating tools for automatic acquisition of information from corpora and cross-fertilization of lexicons • Content interoperability of resources, by supporting ISLE/ISO standards • Cooperative & collective creation and management of LRs, by providing a web-based environment for the collaboration and interaction of distributed agents and resources Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Why an infrastructure of LRs? • Because what is special in Language data … • … is what is more difficult wrt hard sciences, i.e. “language” and its “ambiguity” Already in the ENABLER Mission: Putting together technical, organisational, strategic, political issues of LRs Availability of LRs also a “sensitive” issue, touching the sphere of linguistic & cultural identity, but also with economical & political implications Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Why an infrastructure of LRs? Many dimensions around the notion of language Political issues e.g. a commonly agreed list of minimal requirements for “national” LRs: BLARK Putting togethertechnical, organisational, strategic, political issues of LRs • Cultural issues • Language … and cultural identity • Language … and the Humanities Multilingualism Need of bodies for a broad research agenda & strategic actions for LT&LRs (W/S /MM) Interdisciplinarity & Multidisciplinarity • Economic, social issues • Applications • Services Technical issues Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Technologies exist, but the infrastructure that puts them together and sustains them is still missing Which Communities? for • Humanities • Social Sciences • Digital Libraries • Cultural Heritage • … • Language Resources • Language Technology • Standardisation core Enabling infrastr Multilinguality on • Grid • Semantic Web • Ontologists • ICT • … Focus on cooperation • Many application domains (eculture, egovernment, ehealth, …) for Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006