240 likes | 425 Views
NLP Interchange Format. José M. García. Outline. What is NIF? Design requirements URI schemes NIF ontologies Use cases Relationship with ELRA Roadmap for NIF 2.0 Conclusions . What is NIF?. N atural Language Processing I nterchange F ormat
E N D
NLP Interchange Format José M. García
Outline • What is NIF? • Design requirements • URI schemes • NIF ontologies • Use cases • Relationship with ELRA • Roadmap for NIF 2.0 • Conclusions
What is NIF? • Natural Language Processing Interchange Format • NIF is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. • Building blocks • URI scheme for identifying elements in texts • Ontology for describing common NLP terms • Created and maintained by AKSW group of University of Leipzig, during the LOD2 EU project. • Community project: http://persistence.uni-leipzig.org/nlp2rdf/
URI schemes • Text needs to be referenceable by URIs • With URI references text can be used as resources in RDF statements • NIF distinguishes: • Documents • Text of the document • Substrings of the text. • URI scheme is an algorithm to create IDs for text and substrings • URI elements • Document URI • Separator • Character indices
RFC 5147 • Canonical URI scheme for NIF is based on RFC 5147 • It standardizes fragment identifiers for text/plain media type http://www.w3.org/DesignIssues/LinkedData.html
RFC 5147 • Canonical URI scheme for NIF is based on RFC 5147 • It standardizes fragment identifiers for text/plain media type http://www.w3.org/DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610
RFC 5147 • Canonical URI scheme for NIF is based on RFC 5147 • It standardizes fragment identifiers for text/plain media type http://www.w3.org/DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610 http://www.w3.org/DesignIssues/LinkedData.html#char=1206,1218
NIF Core Ontology • Classes and properties to describe relation between • Documents • Text • Substrings • Corresponding URI schemes
NIF Core Ontology • Additional classes and properties (unstable/testing) • More URI schemes • Text structure (words, sentences, paragraphs…) • Part of Speech (POS) • Annotations with Stanbol • Confidence
Workflows, Modularity and Extensibility of NIF • Workflows for NLP integration • Normalization • Tokenization • Merge RDF annotations
Workflows, Modularity and Extensibility of NIF • NIF ontology logical modules • Terminological model • Inference model • Validation model • Vocabulary modules • FISE • ITS • OLiA • NERD • …
Workflows, Modularity and Extensibility of NIF • Granularity profiles
ITS Use Case • The Internationalization Tag Set 2.0 is a W3C working draft that is becoming a Recommendation. • ITS standardizes HTML and XML attributes which can be used to annotate nodes with processing information for language service providers (i18n, l10n) • ITS 2.0 RDF ontology was developed using NIF, including a round-trip conversion algorithm from ITS to NIF. • NIF is expected to receive wide adoption by translation & language service providers • ITS 2.0 RDF ontology provides properties which can be used to provide best practices for NLP annotations.
OLiA Use Case • The Ontologies of Linguistic Annotation provide stable identifiers for morpho-syntactical annotation tag sets, so that NLP tools can use these ids for better interoperability. • OLiA provides Annotation Models and a Reference Model, comprising more than 110 OWL ontologies for over 34 tag sets in 69 languages • Features • Documentation • Flexible Granularity • Language Independence • NIF provides two properties • nif:oliaIndividual (links a nif:String to an OLiA Annotation Model) • nif:oliaCategory (links to the Reference Model)
RDFaCE Use Case • RDFaContent Editor is a rich text editor that supports WYSIWYM authoring including various views of the semantically enriched textual content. • It combines results of different NLP APIs for automatic content annotation • Heterogeneous APIs access, URI generation and output data structure • Solution: server-side proxy, hard-coded input and connection of each API. • NIF simplified the integration, adding an interoperability layer
What is ELRA? • European Language Resources Association • http://www.elra.info • Effort to make available Language Resources (LR) for language engineering and to evaluate language engineering technologies. • LR marketplace • Related organizations • ELDA (ELRA’s operational body) • LREC conferences
Relationship with NIF • Different objectives • LR written resources (esp. Corpora) can be annotated with NIF for further interoperability and integration with NLP tools • ADVANTAGE: Large test data collection to evaluate NLP tools • DISADVANTAGE: Cost of LR (though there are free ones)
Roadmap for NIF 2.0 • Release of NIF 1.0 • DONE (Nov 2009) • Release of NIF 2.0 Draft • CURRENT effort on solving pending issues • Adoption in ITS 2.0 W3C (soon-to-be) Recommendation • NIF-Core ontology is becoming stable • RLOG - an RDF Logging Ontology • NIF Validator software available • Release of NIF 2.0 Core • Release of NIF 2.0 Extensions • ITS ontology, PROV ontology, Lemon Ontology, NERD, UIMA, MARL opinion ontology…
Conclusions • NIF allows to integrate NLP tools using Linked Data • Ongoing effort • Many adopters and supporters • LOD2 EU project • Several W3C working groups • Named Entity Recognition and Disambiguation (NERD) • Ontologies of Linguistic Annotation (OLiA) • … • 27 different implementations and use cases • Some available at http://persistence.uni-leipzig.org/nlp2rdf/
Thanks for your attention Questions?
References • http://persistence.uni-leipzig.org/nlp2rdf/ • Integrating NLP using Linked Databy Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmerin 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia