400 likes | 488 Views
Describing and Discovering Language Resources. David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh. Overview. Goals: availability and interoperability Service oriented architecture and workflow NLP Components Service description and discovery
E N D
Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh
Overview • Goals: availability and interoperability • Service oriented architecture and workflow • NLP Components • Service description and discovery • NLP and the Grid
What are Language Resources? • Language Resources (LRs) of two kinds: • Static resources: • corpora (text, speech, multimodal) • lexicons, terminologies, ontologies • grammars, declarative rule-sets • Processing resources: • segmenters, tokenizers, zoners, taggers, entity classifiers, chunkers, parsers, …
Goals • Maximize availability of static LRs for automatic processing • Maximize interoperability of processing LRs
LRs on the WWW, 1 • Can use the WWW to locate corpora • Example: OLAC (Open Language Archive Community) • Provides query interface to search for corpora across multiple repositories • Requires standard metadata record for harvesting. • Does not provide access to corpora.
LRs on the WWW,2 • Can use the WWW to directly search corpora • Many examples • BNC Online Search • words (with regular expressions) • tag strings • Typically search is limited (expressiveness, number of results)
LRs on the WWW, 3 • Can use the WWW to download tools • Some tools offer a demo web interface • No interoperability: • you cannot take the output of one web-interfaced tool and feed it as input to another tool
LRs on the WWW, 4 • Challenges for accessing static LRs for automatic processing: • licensing restrictions • file (or database) structure • data format • data transfer • What about processing LRs? • can download, • but not execute in an interoperable manner
Web Services (WS) • WS is a self-contained software resource • Can be located and invoked across the web: • identified by a URL • public interfaces and bindings are defined and described using XML • Other applications interact with it in a prescribed manner • XML-based messages conveyed by internet protocols (e.g. HTTP) • Web services can be composed into complex, distributed applications
Service Oriented Architecture (SOA) WWW description DiscoveryAgencies locate publish Service Provider Service Requester client description interact service Source: Berners-Lee
Web Service: Key Ideas • Interaction with Web Services is • described by • and conducted • using XML documents exchanged over the internet • SOAP protocol • describes the form of messages and how to process them • a way of representing Remote Procedure Calls over HTTP
The Appeal of Web Services • A means of building distributed systems • virtualization — not dependent on any one programming language, OS, development environment • based on well-understood underlying protocols • components can be developed independently • decentralized (apart from DNS)
NLP Services • Fairly easy to wrap legacy code as web services • Allows us to deploy tools across the web as part of a larger application • Corpora can also be deployed as services • Helps with availability interoperability • But still many challenges
Building NLP Applications • Many NLP applications involve relatively few ‘conceptual’ components: • tokenizers, taggers, named entity recognizers, parsers, etc • often different versions of the same components • much repeated (and messy) labour in wiring the components together to interoperate
Issues in Component Approach • Granularity • What is appropriate ‘grain size’ of functionality? • Too fine: heavy overheads in communication, lose ease of use • Too gross: loss of flexibility • Hierarchical decomposition is possible • Compatibility • informational, functional, formal
Linguistic Annotation • Makes information in raw text explicit: • Classification of words and phrases • Detection of structural relationships • Annotation with general and domain-specific semantic labels • Usually proceeds from more concrete to more abstract • Earlier stages of annotation feed into the later stages • Assumed that annotation is represented as XML
tokenize POS tag parse Compatible NLP Services:Substitution POS tag POS tag
parse POS tag tokenize Compatible NLP Services: Sequencing tokenize POS tag parse
WSDL File • XML document, usually on same machine as server • Describes everything involved in calling a web service: • The service URL and namespace • The type of web service • List of available functions • Arguments for each function • Data type of each argument • Return value of each function and data type of each return value
Processor Input and Output Types • Composition of NL processors constrained by input and output types • Candidates for types? • WSDL provides simple data types: • strings, integers, booleans • not expressive enough • Can we build on notion of metadata for LRs?
IMDI Catalogue Specification Catalogue.Title Arabic Treebank Catalogue.Subject-Language ara Catalogue.Content-Type written Catalogue.Format.Text UTF-8 Catalogue.Smallest Annotation Unit word Catalogue.Publisher LDC Catalogue.Size 266 Mb
LR Metadata Standards • Advantages • consistency • software knows what to expect • can be designed according to agreed principles • Challenges • no generally agreed ontology for LRs • hard to get agreement (and who gets to decide?) • categorizations of LRs influenced by favourite linguistic theory • Other people are addressing this issue
What’s missing: tool metadata • What kind of metadata would enable us to ensure tool interoperability? • Neither OLAC nor IMDI provide an answer.
Discovering Resources • Who cares about discovering LRs? • researchers who are searching for LRs that meet specific research criteria • information providers • teachers, journalists, casual browsers • … • Current focus: automatic discovery by software agents
Service Description & Discovery • What LRs can be discovered depends on how the LRs are described. • How LRs are described depends on the requirements for discovery. • Composability: • If an agent (human or software) has already selected component P, what other components Q can provide well-formed input to P ? • Query for all Q such that Q’s output type is compatible with P’s input type
Some Versions of BNC name: British National Corpus, Version 1.0 type: text size: 2866 MB name: British National Corpus, Version 1.0, marked up in XML type: text size: 815 MB name: British National Corpus, Version 1.0, parsed with Charniak parser type: text size: 419 MB name: British National Corpus, Version 1.0, parsed with IMS parser type: text size: 2088 MB name: British National Corpus, Version 1.0, parsed with Minipar type: text size: 448 MB
Corpus Request Scenario • Agent A requests corpus C with property [key = val]. • If C with [key = val] exists, serve it to A. • Otherwise, • find processor P such that output of P(C) satisfies [key = val] • apply P to C • serve result to A • store result for future requests
Service Description • Standard approach • WSDL: describes service inputs/outputs in terms of simple data types • Doesn’t support semantically-based service discovery • Alternatives from Semantic Web • inputs and outputs specified in an ontology language • OWL and RDF both possible
NLP as Document Annotation • NL Processor • takes a partially annotated document as input • yields a more richly annotated document as output
Tagging as document annotation • Part of Speech Tagger • takes in a document with markup of words • yields a document as with additional markup of part of speech
Document Class NB This is just corpus metadata!
Grid & NLP • Parallelism • distribute processes over many machines • use parallel algorithms within process • redundancy and fault tolerance • Distributed data • multiple corpora • distributed annotation of single corpus • Distributed processing pipeline • different components hosted at different sites
Implementation • Based on Globus Toolkit 3.2 middleware • Corpus Services and Transformation Services provide interfaces for corpora and tools • Services Data Elements describe properties of services • properties are aggregated by Index Service, can be queried by clients • Index Service extended by Model Service • provides richer description of services using RDF triples • Backward chaining used to construct pipelines that will produce a requested resource
Summary • Corpus query • for user, no obvious distinction between raw and processed data • Corpus service • either provide existing resource, or generate it • Need to have metadata for tools which allows automatic composition • Metadata needs to allow subsumption matching • using shared controlled vocabulary