1 / 29

GATE, a General Architecture for Text Engineering gate.ac.uk/

GATE, a General Architecture for Text Engineering http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks Department of Computer Science, University of Sheffield UMIST Friday November 29 th 2002.

hamish
Download Presentation

GATE, a General Architecture for Text Engineering gate.ac.uk/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GATE, a General Architecture for Text Engineering http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks Department of Computer Science, University of Sheffield UMIST Friday November 29th 2002

  2. Motivation for Software Infrastructure for Language Engineering • Need for scalable, reusable, and portable HLT solutions • Support for large data, in multiple media, languages, formats, and locations • Lowering the cost of creation of new language processing components • Promoting quantitative evaluation metrics via tools and a level playing field

  3. Motivation (II): software lifecycle in collaborative research • Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. • Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. • Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. • Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype..."). • Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

  4. GATE, a General Architecture for Text Engineering • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/

  5. Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable • Component-based development • An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

  6. GATE Language Resources • GATE LRs are documents, ontologies, corpora, lexicons, …… • Documents / corpora: • GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML. • Processing Resourcres • Algorithmic components knows as PRs – beans with execute methods. • All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing). • 20-30 freebies with GATE • e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

  7. Visual Resources

  8. Displaying Coreference Information

  9. Displaying Syntactic Information

  10. Lexicon Support – WordNet example

  11. GATE Format Handlers ANNIE … Named entity HTML docs RTF docs XML docs Core- ference Document content Document metadata Document format data Linguistic data POS tagger … Named entity … Event extraction … Custom application 1 Relational Database File storage Oracle/ PostgresQL A Language AnalysisExample

  12. Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components

  13. Building IE Components in GATE (2) • JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components • Rule: Company1 • Priority: 25 • ( • ( {Token.orthography == upperInitial} )+ • {Lookup.kind == companyDesignator} • ):companyMatch • --> • :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

  14. Performance Evaluation • At document level – annotation diff • At corpus level – corpus benchmark tool – tracking system’s performance over time

  15. Regression Testing – Corpus Benchmark Tool

  16. The Semantic Web and GATE • GATE is being used for development of (semi-)automatic methods for: • linking web pages to Ontologies using Information Extraction; • learning and evolving Ontologies via IE and lexical semantic network traversal.

  17. Populating Ontologies with IE

  18. Protégé and Ontology Management

  19. Information Retrieval Support Based on the Lucene IR engine

  20. Editing Multilingual Data • GATE Unicode Kit (GUK) • Java provides no special support for text input (this may change) • Support for defining additional Input Methods (IMs) • currently 30 IMs for 17 languages • Pluggable in other applications

  21. Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities:

  22. Dialogue Systems • GATE is being used in the Amities project for automating call centres • Creation of dialogue processing server components to run in the Galaxy Communicator architecture • Easy adaptation of the portable IE components to work on noisy ASR output • Robustness and speed of GATE components vital for real-time dialogue systems

  23. Applications • GATE has been used for a variety of applications, including: • MUMIS: automatic creation of semantic indexes for multimedia programme material • MUSE: a multi-genre IE system • EMILLE: a 70 million word corpus of Indic languages • Metadata for Medline (at Merck) • ACE: participation in the Automatic Content Extraction programme • HSE: summarisation of health and safety information from company reports • OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. • AKT: language technology in knowledge management • AMITIES: call centre automation • Various Medical Informatics and database technology projects • IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian next year)

  24. Some users… At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN; • Sirma AI Ltd., Bulgaria; • the American National Corpus project, US; • Imperial College, London, the University of Manchester, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities; • the Perseus Digital Library project, Tufts University, US.

  25. The MUMIS project • Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from multiple sources in different languages • ASR, video processing, information extraction (Dutch, English, German), merging, user interface • University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA • Yorick Wilks, Hamish Cunningham, Horacio Saggion, Kalina Bontcheva, Diana Maynard, Oana Hamza, Cristian Ursu

  26. Merging Formal Text Formal Text Formal Text Anno-tations The Whole Picture Ontology & Lexicon IE DE Formal Text Formal Text Final Annotations Formal Text IE Formal Text NL Formal Text Formal Text Formal Text EN Formal Text Formal Text Text Sources IE Video & Audio Signal Query Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Multimedia Data Base Formal Text Speech Signals Formal Text User Interface Trans criptions ASR Results

  27. User Interface

  28. Play

  29. Conclusion • GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components • Further information: http://gate.ac.uk/ • Online demos, tutorials and documentation • Software downloads • Talks and papers

More Related