410 likes | 540 Views
e Content plus. F ostering La nguage Re sources Net work. http://www.flarenet.eu. Standards: strength and limitations … LMF Nicoletta Calzolari glottolo@ilc.cnr.it. In Europe the so-called X-LEX projects: ACQUILEX MULTILEX GENELEX
E N D
e Content plus Fostering Language Resources Network http://www.flarenet.eu Standards: strength and limitations … LMF Nicoletta Calzolari glottolo@ilc.cnr.it NEERI Workshop, Helsinki, September 2009
In Europe the so-calledX-LEX projects: ACQUILEX MULTILEX GENELEX and other lexical and text annotation/representation projects: NERC ET-7 ET-10 DELIS that saw the participation of many EU groups, linked by sharing similar approaches and visions EAGLES ISLE After the “Grosseto Workshop” (1985): a turning point Historical notes Start: Zampolli breakfast meeting EAGLES acronym … by Cencioni NEERI Workshop, Helsinki, September 2009
Key issues: Do conditions exist for standardisation effort? • Reusability as key concept true also today • To avoid duplication of efforts, costs, etc. • To allow synergies, integration, exchange of data, ... • To provide a model for new data creation & acquisition • Decide on “feasible” areas & state priorities this is changing over time • The feasibility of formulation of consensual standards as a strong sign of maturity in the field we can’t propose standards if there are not enough results on which to base them • EAGLES was launched in ‘93 NEERI Workshop, Helsinki, September 2009
Some standard-related projects & initiatives • Definingstandards/best practice: • TEI: creating standards for text annotation • NERC: creating the basis to bottom-up empirical harmonisation, based on extensive best-practice analysis • EAGLES: introducing a methodological model for standard work • ISLE: extending in topics & communities • LIRICS: preparing for international standards • ISO/TC 37/SC 4/WG 4: going to international standards LMF … & many others • NEDO: porting to Asian languages • MultilingualWeb: new Thematic Network for relation with W3C NEERI Workshop, Helsinki, September 2009
Some standard-related projects & initiatives (cont.) • Using standards/best practice: • MULTEXT & MULTEXT-EAST: applying to lexicons & text annotation, with EAGLES compliant specs • PAROLE-SIMPLE lexicons: morphology, syntax & semantics: operational specs & constraints betw. lexical descriptors (12 languages) • EuroWordNets: a de-facto best-practice • BOOTStrep: terminologies in Bio-domain: BioLexicon • KYOTO: in the environment domain • PANACEA: in a platform for LR acquisition NEERI Workshop, Helsinki, September 2009
Some standard-related projects & initiatives (cont.) • Promoting standards/best practice: • INTERA: for a EU repository of language data • ENABLER: to link EU & national initiatives • ELRA: the EU LR association • LanguageGrid: Japanese infrastructure for LR services • CLARIN: LR standards for the Humanities & Social Sciences • FLaReNet: LR standards for Human Language Technologies • T4ME NoE: for an Open Resource Infrastructure NEERI Workshop, Helsinki, September 2009
Main Results in Lexicon & Corpus WGsFirst Phase (www.ilc.pi.cnr.it/EAGLES96/home.html) Standard for morphosyntactic encodingof lexical entries, in a multi-layered structure, with applications for all the EU languages Standard for subcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure Proposal for a basic set of notions in lexical semantics:focus on requirements of Information Systems and MT Corpus Encoding Standard (CES) from TEI Standard for morphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata Preliminary recommendations for syntactic annotation of corpora Dialogue annotation, for integration of written and spoken annotation NEERI Workshop, Helsinki, September 2009
Content vs. Format/Representation Work on lexical description deals with two aspects Linguistic descriptionof lexical items (content) Formal representationof lexical descriptions (format) EAGLES concentrated onlinguistic content, not disregarding the formal representation of the proposal TEI more on format/representation issues In In LMF : on the abstract meta-model NEERI Workshop, Helsinki, September 2009
Flexibility in the Recommendationse.g. Morphosyntax Level Information Type Recommendation L-0 Part-of-Speech Obligatory L-1 Morphosyntactic agreement Recommended features L-2 Language-specific (or refined) Optional features NEERI Workshop, Helsinki, September 2009
MERITS Strengths (from EAGLES-ISLE) Standardisation as a necessary component of any strategic programme to create a coherent market Leading industrials & academics participated (> 150 EU groups) Bottom-up community created standards To avoid wasting timereinventing basic/consolidated knowledge May be true also for many “humanities” users, not interested in debates on specific lexical approaches Work otherwise duplicated among many projects, done just once in a collaborative manner (overall cost-effectiveness) Allows the field to bemore competitive: Concentrate efforts on innovative areas Engage in new/advanced technology NEERI Workshop, Helsinki, September 2009
Why Standards for Language Resources? (from EAGLES-ISLE) To ensure: • interoperability of systems (& data), through compatible interfaces • reusability and integrability of components • training based on consensual technical specifications and models (“gold standards”) • evaluation & validation based on agreed criteria • transition from prototypes to HLT products important for workflows essential for a LR Infrastructure for evaluation campaigns NEERI Workshop, Helsinki, September 2009
The applications: requirements for systems & enabling technologies Machine Translation Information Extraction Information Retrieval Summarisation Natural Language Generation Word Clustering Multiword Recognition + Extraction Word Sense Disambiguation Proper Noun Recognition Parsing Coreference … For HLT knowledge of applications’ requirements is essential I NEERI Workshop, Helsinki, September 2009
The Multilingual ISLE Lexical Entry (MILE) General methodological principles (from EAGLES) Basic requirements for the design of theMILE: Discover and list the (maximal) set of basic notions needed to describe the MILE (up to which level standardisation is feasible?) Granularity The leading principle: the edited union of existing lexicons/models (redundancy is not a problem) Modular & layered Allow for underspecification (& hierarchical structure) NEERI Workshop, Helsinki, September 2009
MILE – Modularity The building-block model Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Objects Sem feature syntactic frame slot Syn feature phrase Independent, but interlinked, modules allow to express different dimensions of lexical entries NEERI Workshop, Helsinki, September 2009
MILE Lexical Classes & Lexical Objectsvs ISO LMF Lexical Classes as the main building blocks of the lexical architecture Building blocks allow two kinds of reusability: intra-lexicon reusability (within the same lexicon) inter-lexicon reusability (among different lexicons) Define an ontology of lexical objects represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. specify the relevant attributes define the relations with other classes hierarchically structured Done in LMF To be done … (in ISOCat?) NEERI Workshop, Helsinki, September 2009
The MILE Data Categories User-adaptability and extensibility MLC:SemanticFeature instance_of Core HUMAN ARTIFACT EVENT ANIMAL GROUP OK in ISOCat AGE MAMMAL UserDefined NEERI Workshop, Helsinki, September 2009
MILE Lexical Data Category RegistryA library of pre-instantiated objects Enables modular specification of lexical entities eliminate redundancy identify lexical entries or sub-entries with shared properties create ready-to-use packages that can be combined in different ways Can be used “off the shelf” or as a departure point for the definition of new or modified categories • ISOCat • ISO Profiles NEERI Workshop, Helsinki, September 2009
ISO - LMFLexical Markup Framework Designed to accommodate many models of lexical representation Its pros: Meta-model: abstract high-level specification ISO24613 Data Category Registry: low-level specifications ISO12620 Not a monolithic model, rather amodular framework LMF libraryprovides the hierarchy of lexical objects (with structural relations among them) Data Category Registryprovides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) NEERI Workshop, Helsinki, September 2009
ISO LMF Builds on EAGLES/ISLE The field is mature Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions • Modular framework • LMF specs comply with modelling UML principles • an XML DTD allows implementation New initiatives … LIRICS ICT KYOTO NEDO Asian Lang. LexInfo NICT Language-Grid Service Ontology NEERI Workshop, Helsinki, September 2009
Mapping experiment Major best practices: OLIF PAROLE/SIMPLE LC-Star (Speech Lexicon) WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French Entries from major existing lexicons mapped to LMF • To prove that the model is able to represent many best practices • To test the expressive potentialities, the adequacy of architectural model & linguistic objects from Monica Monachini NEERI Workshop, Helsinki, September 2009
BioLexicon SIMPLE model & ISO-LMF standard A unique large-scale computational lexicon in the biomedical domain in terms of coverage & typology of information Designed to meet bio-Text Mining requirements BL Populated with info from available biomedical resources Including both domain-specific & general language words Semi-automatically populated from corpora: Population toolkit available Rich linguistic information ranging over different linguistic descriptions levels Conformant to international lexical representation standards from Monica Monachini NEERI Workshop, Helsinki, September 2009
Sense Representation Synset activate <Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense> PredicativeRepresentation Sense activate_2 SemanticFeature SF_chemistry SF_process Collocation SemanticRelation is_a: [SenseID] Typical_of: [SenseID] S_protein NEERI Workshop, Helsinki, September 2009
KYOTO SYSTEM Source Documents Linear MAF/SYNAF Term extraction Tybot Semantic annotation Generic TMF Linear SEMAF Fact extraction Kybot Domain editing Wikyoto Fact User Concept User LMF API OWL API Linear Generic FACTAF Domain Wordnet Domain ontology Wordnet Ontology from Piek Vossen NEERI Workshop, Helsinki, September 2009
Data Categories A common representation format: WordNet - LMF LexicalResource 1..* 0..1 1..1 GlobalInformation Lexicon SenseAxes 1..* 0..* 1..* 0..1 Meta Synset SenseAxis LexicalEntry 0..1 0..1 0..* 0..1 0..1 1..1 MonolingualExternalRefs InterlingualExternalRefs Lemma Sense Definition SynsetRelations 0..1 0..* 1..* 1..* 1..* MonolingualExternalRefs MonolingualExternalRef InterlingualExternalRef Statement SynsetRelation 0..1 0..1 0..1 1..* MonolingualExternalRef Meta Meta Meta 0..1 Meta from Monica Monachini NEERI Workshop, Helsinki, September 2009
Centralized WordNet DC Registry A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid Intra-WN Inter-WN from Monica Monachini NEERI Workshop, Helsinki, September 2009
WordNet-LMF multilingual level - Cross-lingual relations <!ELEMENT SenseAxes (SenseAxis+)> <!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)> <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ELEMENT Target EMPTY> <!ATTLIST Target ID CDATA #REQUIRED> <!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)> <!ELEMENT InterlingualExternalRef (Meta?)> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN <fuoco_1, fiamma_1> 00001251-n SWN <fuego_3, llama_1> 09686541-n groups monolingual synsets corresponding to each other and sharing the same relations to English WN3.0 <fire_1 flame_1 flaming_1> 13480848-n specifies the type of correspondence link to ontology/(ies) from Monica Monachini NEERI Workshop, Helsinki, September 2009
LexInfo & Previous Models From Paul Buitelaar NEERI Workshop, Helsinki, September 2009 • LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006] • LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007] • Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007] • LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
LexInfo: Lexical EntrySub-Categorization Frames From Paul Buitelaar NEERI Workshop, Helsinki, September 2009
MILE Lexical Model oriented towards an Open Distributed Lexical Infrastructure Lexical Information Servers for multiple access to lexical information repositories Enhance user-adaptivity resource sharing cooperative creation of LR & LT Develop integration and interchange tools Beyond MILE: future work NEERI Workshop, Helsinki, September 2009
Some steps for a “new generation” of LRs From huge efforts in building static, large-scale, general-purpose LRs TodynamicLRs rapidly built on-demand, tailored to specific user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them Interoperability • From Language Resources • To Language Services • BUT • Need of tools to make this vision operational & concrete NEERI Workshop, Helsinki, September 2009
Lexical WEB & Content Interoperability As a critical step for semantic mark-up in the SemWeb Global WordNet GRID NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE-WEB SIMPLE LMF Lex_x BioLexicon FrameNet Lex_y Standards for Interoperability Enough?? NEERI Workshop, Helsinki, September 2009
A new paradigm of R&D in LRs & LTDistributed Language Services Open & distributed infrastructures for LRs & LT Adopting the paradigm ofaccumulation of knowledgeso successful in more mature disciplines, based on sharing LRs & tools Ability to build on previous achievements, results accessible to various systems, allowingeffective cooperation of many groups on common tasks Exchange and integrate information across repositories Create new resources on the basis of existing Compose new services on demand … A new scenario implying content interoperability standards development of architectures enabling accessibility supra-national cooperation NEERI Workshop, Helsinki, September 2009
A few Issues for discussion:“content”, guidelines, tools, priorities, ... For Semantic Web and “content” interoperability: is the field ‘mature’ enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)? For the standards to have impact, ensure their usability & gain industry support focusing on requirements of industrial applications To have Guidelines which are a “usable product” (to assist in creation or adaptation of lexicons, to share resources, …) Facilitate acceptance of the standards providing an open-source reference implementation platform & tools, related web services and test suites Relation with Spoken language community Define further steps necessary to converge on common priorities NEERI Workshop, Helsinki, September 2009
Limits observed& needs of further work For usability & operability of LMF: Data Categories (DC) & others: From Japanese NEDO: DC not defined in LMF & LMF non operational Asian, African DCs Need of DC organised in profiles (easy to use) IsoCat & Profiles Need of an ontology of DCs with structure/dependencies, and constraints Otherwise the model remains too abstract, and doesn’t say anything on how to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of tools to make it operational, also for creating standard compliant resources: more important than the model! More dissemination, also with industry Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant resources (unless differently motivated) NEERI Workshop, Helsinki, September 2009
Strengths Good set of methodological principles: Granularity of basic notions, … Many languages already compliant with EAGLES morpho-syntax, etc. Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at ELRA (possible because EAGLES compliant) Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, … NEERI Workshop, Helsinki, September 2009
Future requirements & planning To make LMF usable and operational LMF User Guidelines with examples Mapping of commonly used lexicons into LMF Data categories for LMF lexicons Tool related to LMF, with particular reference to the Lexus tool Need to address another layer The ontological layer in a lexicon How lexicons and ontologies are linked and information mapped from each other An open space in a wiki encironment to store guidelines, examples to allow broad discussion on these topics to ease dissemination of LMF NEERI Workshop, Helsinki, September 2009
FLaReNet Mission: structure the area of LR & LT of the future Worldwide Forum for LRs & LTs Consolidate methods, approaches, common practices, architectures Integrate so far partial solutions into broader infrastructures A “roadmap”: a plan of coherent actions as input to policy development For the EU, national organisations & industry As a model for the LRs/LTs of the next years Strengthening the language product market, e.g. for new products & innovative services Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is required Indicating priorities • 221 Individual Subscribers • 81 Institutional Members from 31 countries NEERI Workshop, Helsinki, September 2009
Promote knowledge of standards in the community Define specifications for tools supporting standards Support workshops/tutorials on how to use standards Start focusing on standards for more consensual areas & develop for these a toolkit that can be used off-the-shelf, so that we can move on to tackling the larger problems Identify “best practices” in standards wrt usability, usefulness, viability, outreach etc. Adopt a model for tool & resource development based on open & collaborative development, where the community as a whole contributes components, modules, etc. to a common framework Some results from FLaReNet Vienna Forum: Interoperability Session NEERI Workshop, Helsinki, September 2009
Standards & Interoperability: topics for cooperation A metadata catalogue should involve every party Common repositories for LRT universally & easily accessible Try to connect ongoing work done by many groups A shared repository of data formats, annotations – where to find the most frequently used and preferred schemes –major help to achieve standardisation For a new world-wide language infrastructure Create the means to plug together different LR & LT, in a web-based resource and technology grid Access to LRT is critical: involves – and has impact on – all the community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have more open (source) tools available for use also to under-funded groups Some results from FLaReNet Vienna Forum: International Cooperation NEERI Workshop, Helsinki, September 2009
Special Highlight: Contribute to building the LREC2010 Map! Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation. The Map will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure. First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years. When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research The Map will be disclosed at LREC, where some event(s) will be organised around this initiative FLaReNet & the ORI (Open Resource Infrastructure) … at LREC NEERI Workshop, Helsinki, September 2009
Join FLaReNet! We invite all interested players in the field to express their interest in becoming part of the Network How to join? To be part of the FLaReNet Network fill the form available on the project website (http://www.flarenet.eu) NEERI Workshop, Helsinki, September 2009