110 likes | 195 Views
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster
E N D
A Unicode-based Environment for the Creation and use of LRs • Valentin Tablan, Cristian Ursu, Kalina Bontcheva, • Hamish Cunningham, Diana Maynard, Oana Hamza, • Tony McEnery1, Paul Baker1, Mark Leisher2Department of Computer Science, University of Sheffield • 1University of Lancaster • 2New Mexico State University • GATE (a General Architecture for Text Engineering) and ML LRs • Motivation (history of men’s underwear) • Short definition of GATE • GATE, Unicode and Java • EMILLE 1(11)
Motivation for Software Infrastructure for Language Engineering • Analogy with recent history of men’s underwear – also supportive infrastructure: • The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive • The brave new world: boxer shorts: still supportive, but less constraining • The purpose of our work (the boxer shorts ideal): • freedom within a supportive environment 2(11)
GATE is: • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/ 3(11)
Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards) • (Almost) everything is a component, and component sets are user-extendable • Component-based development • An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 4(11)
GATE Language Resources • GATE LRs are documents, ontologies, corpora, lexicons. • Documents / corpora: • GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML. • Multilinguality: • New internationalised versions of JVM support >100 different encodings. • Other encodings: developing system for user-entry of mapping tables. • LR persistence through XML, file datastore or databases (Oracle, PostgreSQL). 5(11)
Processing Resourcres • Algorithmic components knows as PRs – beans with execute methods. • All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing). • 20-30 freebies with GATE • Unicode Tokeniser • splits text into typed tokens based on FSM • dynamically constructed from a set of rules based on the character categories defined by the Unicode standard. • UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; • output can be localised by a later module (e.g. “don’t” … “do” “n’t”) • current status: • 23 rules seem able to handle without changes Indo-European languages. • the English tokeniser: Unicode tokeniser + pattern grammar FST. 6(11)
Displaying Multilingual Data (1) • GATE uses standard (and imperfect) Java rendering engine for displaying text. 7(11)
Displaying Multilingual Data (2) • All the visualisation and editing tools for ML LRs use the same facilities: 8(11)
Editing Multilingual Data • Java provides no special support for text input (this may change) • GATE Unicode Kit (GUK) plugs this hole • Support for defining additional Input Methods; currently 30 IMs for 17 languages • Pluggable in other applications (e.g. MPI’s EUDICO) • Can use virtual keyboard or standard layouts over QWERTY • IMs defined in plain text files • GUK comes with a standalone Unicode editor 9(11)
EMILLE: Enabling Minority LE • 3 year EPSRC project at Lancaster University and Sheffield University. • Corpus development: • written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. • spoken corpora of at least 500,000 words per language. • Unicode developments for GATE: • Indic keyboard layouts. • encodings for Indic languages. • Development of basic LE tools: • POS tagging. • alignment tools for parallel corpora. 10(11)
Encore • http://gate.ac.uk/ • Other GATE-related stuff at LREC: • Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05] • Baker et al.: EMILLE [Thurs, 10.25] • Demo and poster [Thurs, 11.00-12.20, session D1] • Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20] • Fliers 11(11)