330 likes | 443 Views
A case study on interoperabilty for language resources and applications. Marta Villegas, Núria Bel, Santiago Bel, Víctor Rodríguez. Index. Use case Requirements & problems Corpus collection / distributed search Corpus integration Services interoperability Common interfaces
E N D
A case study on interoperabilty for language resources and applications Marta Villegas, Núria Bel, Santiago Bel, Víctor Rodríguez
Index • Use case • Requirements & problems • Corpus collection / distributed search • Corpus integration • Services interoperability • Common interfaces • Shared type system • Semantic description • Conclussions
Use case Historical linguistics. Diachronic & comparative study involving different romance languages
Use case: currentscenario Catalan refence corpus Fully annotated data since 1833 XML database Old Catalan (annotation in progess) MySQL database Old Catalan (fully annotated) MySQL database Hitorical digital library Pandora/PANDAS system Technical corpus (fully annotated) Corpus Work Bench (CWB)
C L A R I N ISO METADATA REGISTRY Use case: desiredscenario (distributedsearch) request response
Use case: Requirements • Data collection (distributed search) • Data integration • Interoperability of services
Corpus collection / Distributed search • Metadata interoperability • Language (català, cat, catalan) ISO • Date (s XV, 1400-1499, 1400/) ISO • Genre (!!) • Common search protocol
request server client server server server response Corpus collection / Distributed search SRU: Web Service-basedprotocolforquerying Internet indexes ordatabases Syntax: CQL Semantics: Context Sets & Profiles
Corpus integration MAF2CQP DATA DATA DATA DATA MAF Annotated Data CWB
Corpus integration Wrappers MAF2CQP DATA DATA DATA DATA Format CWB Tags
Corpus integration Annotated? DATA DATA DATA DATA yes no Format PoS tagger PoS tagger wrappers PoS tags Freeling Apertium N-grams CWB
Services interoperability • Deployment of NLP tools as SOAP Web Services: • Definition of common interfaces • Definition of shared types to model standard request & response messages • Explore the semantic description of WS not only for discovering purposes but also for invoking them
Services interoperability Command line WSDL message (SOAP WS) • $ TagText -text • -numlines • -tagonly • -prepronly • -tagblanks • -notagurl • -notagemail • -notagip • -notagdns • -encoding • -errors • name=“TagText” • part name=“numlines” • part name=“Tagonly” • part name=“Prepronly” • part name=“ Tagblanks” • part name=“notagurl“ • part name=“Notagemail” • part name=“Notagip” • part name=“Notagdns” • part name=“Encoding” • part name=“Errors”
Services interoperability • <wsdl:message name=“CommandLineRequest"> • <wsdl: part name=“numlines” element=“numlines“></wsdl:part> • <wsdl: part name=“Tagonly” element=“Tagonly“></wsdl:part> • <wsdl: part name=“Prepronly” element=“Prepronly“></wsdl:part> • <wsdl: part name=“Tagblanks” element=“Tagblanks“></wsdl:part> • <wsdl: part name=“Notagurl“ element=“Notagurl“></wsdl:part> • <wsdl: part name=“Notagemail” element=“Notagemail</wsdl:part> • <wsdl: part name=“Notagip” element=“Notagip“></wsdl:part> • <wsdl: part name=“Notagdns” element=“Notagdns“></wsdl:part> • <wsdl: part name=“Encoding” element=“Encoding“></wsdl:part> • <wsdl: part name=“Errors” element=“Errors“></wsdl:part> • </wsdl:message>
Services interoperability / common interfaces • Interperability is achieved by separating interfaces from implementations • Common interfaces need: • An agreed set of operations • Compatibility of elements in I/O messages and • Compatibility of schema structures in message elements.
<wsdl:types> (Shared !!) type declaration </wsdl:types> Services interoperability (wrapped document style) <wsdl:message name=“CommandLineRequest"> <wsdl:part name=“parameters“ element=“parameters”> </wsdl:part> </wsdl:message> • Type sharing, • Type reusing • Type extension
Services interoperability <wsdl:message name=“POSTaggerRequest"><wsdl:part name="POSTaggerParams"element="POSTaggerParams“</wsdl:part> </wsdl:message> VALID SOAP MESSAGE <POSTaggerParams > <MainParams> <language>spa</language> <text> <file>http://somewhere/somefile</file> </text> </MainParams> <optParams></optParams></POSTaggerParams>
Language guesser IF POStagger POStagger POStagger XProcess Services interoperability
Language guesser IF POStagger POStagger POStagger XProcess Services interoperability
Language guesser IF POStagger POStagger POStagger XProcess Services interoperability
Services interoperability Format of the SOAP message (message moving arround between services) NOT the structure of the message content. VALID SOAP MESSAGE $ TagText –lang –file <POSTaggerParams > <MainParams> <language>spa</language> <text> <file>http://somefile</file> </text> </MainParams> <optParams></optParams></POSTaggerParams> $ analyzer –f config/en.cfg
Services interoperability ISO-639-3-code VALID SOAP MESSAGE <POSTaggerParams > <MainParams> <language>spa</language> <text> <file>http://somefile</file> </text> </MainParams> <optParams></optParams></POSTaggerParams> URI MIME types
Services interoperability VALID SOAP MESSAGE Annotated Text ?? <POSTaggerResponse> <MainParams> <POSAnnotatedText> <file>http://somefile</file> </POSAnnotatedText> </MainParams></POSTaggerResponse> NOT everything is XML so NOT everything has a XSD type
Services interoperability 1- Identification of basic operations & I/O
Services interoperability 2- Taxonomy & Domain elements Taxonomy Domain elements
Services interoperability 3- Web services descriptions (MyGRID) Service Ontology and Domain Ontology (no invocation details). Service Ontology acts as a service model and the Domain Ontology acts as controlled vocabulary for the model ‘External’ xml annotation of services compliant with the model
Conclussions Metadata & Data interoperability: • Standards are a must but sometimes: • Not well documented • Lack of tools • too weak • Different approaches/methods (what is a token?) • Network effect will improve the situation
Conclussions Services interoperability • Common interfaces and shared types • Type sharing, type reusing and type extension enable to model messages according to some common schema • This does not mean that all objects (I/O) moving around need to adhere to a schema: • Many I/O objects are not XML objects
Conclussions • In the best case: I/O types come from a common type system (different type systems can coexist) • These types may simply identify the existence of a particular data type or may further describe the internal structure of the data type. • In the worst case: types are local and remain ‘underspecified’ as far as their content is concerned.