420 likes | 693 Views
Experiences with UIMA from a User’s Perspective. Dietmar Rösner, Manuela Kunze, Hany Mahgoub. University of Magdeburg C Knowledge Based Systems and Document Processing. Overview. Introduction GATE UIMA Conclusion. Introduction.
E N D
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document Processing
Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Introduction "IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies." • November 2005; Version 1.2.3 of UIMA is available Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Introduction really? Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Introduction • similarity/comparison of GATE and UIMA • frameworks • results are documents + annotations • pipeline processing • steps: • task definition • one corpus Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Evaluation Topics/Points • ease of getting acquainted with system?: • quality of docus: completeness, clarity, up-to-date, …? • tutorials, use cases, …? • processing and linguistic resources? • lexica, Gazetteer lists, tools • tools for resource maintenance and extension? • quality: selfexplanatory, robust, comfortable • speed of processing? • single docs vs. large corpora? • limitations, suggestions for improvement? • support for im-/export of a variety of document formats? Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Task of the Experiment • process a corpus of websites • to detect and extract information relevant for tourists • opening times of museum, prices of hotels,… • corpus: • 30 tourism web sites of Egypt • additional 20 web sites of Washington, New York, London • output: • Prolog facts for a reasoner • Questions: • Which museum is now open? • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Excerpts from the Corpus • The Egyptian Museum is open the hours: 9am-5pm daily • The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm • Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) • 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: General Architecture for Text Engineering • a suite of tools for language processing and information extraction • rule-basedmodular IE system (ANNIE) • language and domain-independentprocessing resources • open and extensible architecture • aims to provide uniform access to various linguistic and ontological resources • http://gate.ac.uk/ Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: General Architecture for Text Engineering • a software infrastructure for NLP researchers; based on three main elements: • an architecture • describing the components composing a language processing system • a framework • could be used as a basis for building such systems • a graphical development environment • a set of tools and • components for language engineers Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: General Architecture for Text Engineering • GATE distributed with IE system called ANNIE • relies on finite state algorithms and the Java Annotation Pattern Engine (JAPE) language • comprising a set of core Processing Resources (PRs): • Tokeniser • Gazetteers • POS tagger • Sentence Splitter • Semantic Tagger (JAPE transducer) • Orthomatcher (orthographic coreference) • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: ANNIE [Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Gate Application • several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer ... * The Military Museum* Summer: 8am-5:30pm; Winter: 9pm-5pm … ANNIE English Tokenizer Gazetteer lists JAPE Transducer • JAPE rules: to annotate • interval of times and restrictions • museum names of museums, fragments of times and restrictions Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Museum information in JAPE Rule: egyptmuseums ( ({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} //from gazetteer lists ({SpaceToken})? (({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))* ({timeinfo}) // annotation by jape transducer ) :museum --> :museum.sight = {rule ="egyptmuseums"} • timeinfo defined by JAPE rules detects patterns like: • 9am-5pm, 6pm-9pm • 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm • 5:00PM-7:00PM, 10:00am-5:00pm • …. Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Presentation of Results Type and location of every extracted annotation on document Annotations Museums Information Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Results • information annotated in the documents: • names of museums, hotels • names of tourist places in Egypt • times, time intervals • time restrictions • prices, intervals of prices (hotel prices and museum prices) • names of pharaohs, queens Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • good • illustrative examples (tutorial) but not enough specialy about JAPE rules • can deal with it without know of Java programming • but is advantage to have experinces with Java programming to use it in JAPE rules Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • many processing resources available (ANNIE) • tokenisers • POS taggers • parsers • gazetteers • sentence splitter • … • additional PRs : • gazetteer collector • PRs for Machine Learning • various exporters • annotation set transfer etc... Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • editor for gazetteer list • corpus manager • text editor and debugger for JAPE rules Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • there is no measurement of processing time in the GATE tool Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • corpus pipeline vs document pipeline Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • no limitations: • all is possible but it is not necessary to implement by yourself • for beginning: • processing and linguistic resources available within the distribution Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • import: • supports a variety of document formats: HTML, rtf, email, SGML and plain text • In all cases the format is analysed and converted into a single unified model of annotation • export: • documents, corpora and annotations in databases of various sorts • required: Java application (CREOLE) Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Unstructured Information Management Architecture • a software architecture for developing and deploying unstructured information management (UIM) applications • UIM application: a software system • analyse large volumes of unstructured information to • discover, • organize, and • deliver relevant knowledge to the end user • software architecture which specifies • component interfaces, data representations, … • http://www.research.ibm.com/UIMA/ Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Unstructured Information Management Architecture … may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS. … takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers. … interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata. CAS: Common Analysis Structure CPM: Collecting Processing Manager … consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Unstructured Information Management Architecture • Analysis Engine (AE): • a component that analyzes artifacts (e.g. documents) and infers information about them • consists of two parts: • Java classes (typically packaged as one or more JAR files) and • AE descriptors (one or more XML files) • the configuration settings for the Analysis Engine as well as • a description of the AE’s input and output requirements. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA Application • several annotators (like a pipeline) ... *Fraunces Tavern Museum* 54 Pearl St. - 1-212-425-1778 Tuesday-Friday, 12pm?5pm; … regular expressions restrictions Prolog facts: museumopen('Fraunces Tavern Museum ', '2005-12-01T12:00:00', '2005-12-01T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-02T12:00:00', '2005-12-02T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-03T12:00:00', '2005-12-03T17:00:00'). interval of times museum information time pattern window covering two time intervals and a restriction museum pattern regular expressions window covering a museum and opening hours regular expressions Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Results • information annotated in the documents: • names of museums, hotels • times, time intervals • time restrictions • prices, intervals of prices (hotel prices) • keywords for museum category • names of pharaohs (annotated with a correction of mispellings) • hotel and museum information are exported into Prolog facts and into a short textual summary • templates filled with the detected information • hotels: Price information about Cosmopolitan Hotel : $157 • museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • good • illustrative examples (tutorial) • completeness: sometimes it is very shortly described • prior knowledge about Java and Eclipse is helpful Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • annotators only from tutorial • sentence annotation • word annotation • date/time annotators • examples for using regular expressions etc. • external resources can be integrated: • lexical resources as external resources (text files) • existing processing resources • implementation of an interface is necessary Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • specific Eclipse component editors or • simple text Editors Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • faster than GATE? • in CPE detailed information about processing time for each module Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • Collection Reader • document(s) from a directory • adapt extensions into Preprocessing (CAS Initializer) • e.g., extraction of text fragments from a HTML document Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • no limitations: • all is possible, but implementation or interfacing by user • wish: • more processing and linguistic resources within the distribution Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • import: CAS Initializer • export: CAS Consumer • transform annotations in any other format • export of • document + annotations • only annotations • required: Java application Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Conclusion • intended use • GATE: academic/scientific application • tools available • comfortable GUI • UIMA: more commercial • plain framework • simplified definition of (complex) results structures • simplified pre- and postprocessing of annotations • in sum: incommensurable Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Conclusion • both are extensible • no final judgement about: use GATE or UIMA • depends on • your task • task description • expected results • which processing resources are necessary • your preferences for interface • prefer the Eclispe environment (or other Java editors) • prefer a comfortable GUI • or use both Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Conclusion • found in the UIMA Forum: I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other. GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document). UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing.We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on http://gate.ac.uk/ for details. Ian Roberts (GATE developer) Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective