120 likes | 326 Views
GATE: an AKT success story [GATE: open source language technology component architecture and many tools, with a number of AKT roles] http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva Yorick Wilks Southampton, January 2004 New GATE-related projects
GATE: an AKT success story • [GATE: open source language technology component architecture and many tools, with a number of AKT roles] • http://gate.ac.uk/http://nlp.shef.ac.uk/ • Hamish Cunningham • Kalina Bontcheva • Yorick Wilks • Southampton, January 2004 • New GATE-related projects • Current state of the system • Future plans
SEKT: €9m IP with BT, AIFB, JSI, Empolis, SAI, OntoPrise, ISOCO, UB, Kea-Pro PrestoSpace – €9m IP with BBC, RAI, ORF, INA, ...: preservation of audio-visual media KnowledgeWeb – NoE successor to OntoWeb ETCSL – GATE for humanities scholars hTechSight – petrochem tech oversight SWAN – large-scale semantic annotation New Projects 2(12)
SEKT: large-scale DM + robust HLT for NGKM KEY MNLG: Multilingual Natural Language GenerationOBIE: Ontology-Based Information Extraction(MI)IE: Mixed-Intiative IECLIE: Controlled Language IE (M)NLG Semantic Web; Semantic Grid;Semantic Web Services Formal Knowledge(ontologies andinstance bases) HumanLanguage OBIE (MI)IE ControlledLanguage CLIE 3(12)
SEKT: Evaluating Semantic Tagging • Need for new metrics when evaluating hierarchy/ontology-based NE tagging • Need to take into account distance in the hierarchy • Tagging a company as a charity is less wrong than tagging it as a person • Several SEKT-related initiatives (w/s at ECAI; Pascal network) 4(12)
Cultural Heritage / Digital Libraries IP BBC, RAI, ORF, INA, B&G, USFD, and 23 others (!) 20th Century Rot: rapid disappearance of audio-visual media Preservation and digitisation is high cost Therefore we need rich metadata and semantic access Little training data, open domain: FSTs for users Follows MUMIS and other projects Evaluation: TRECVID, OBIE PrestoSpace 5(12)
Stable core since end 2002 Increasing numbers of users (next slide) Increasing numbers of languages (most recently: Chinese, Arabic, Russian, German system from DotKom) Increasing numbers of 3rd party components (e.g. Medline and UMLS work, OBIE/KIM, QA, summarisation, ...) Embedded in KM applications GATE Status (version 2½) 6(12)
GATE team projects. Past: MUMIS: semantic index of sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e-science EMILLE: S. Asian languages corpus ACE/ TIDES: Arabic, Chinese NE Present: Advanced Knowledge Technologies SEKT: next-generation KM PrestoSpace: audiovisual preservation) KnowledgeWeb: semantic web network h-TechSight: technology oversight ETCSL: Sumerian language corpus SWAN: Semantic Web Annotator MiAKT: medical informatics KM Thousands of users at hundreds of sites (based on survey of 4,700 downloaders). A representative sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Greenstone digital library, NZ Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs inc. Sirma AI Ltd., Bulgaria Imperial College, London, the University of Manchester, UMIST, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc.MyGrid, CLEF, DotKom, AMITIES, Cub Reporter, EMILLE, Poesia... A bit of a nuisance (GATE users) 7(12)
Johns Hopkins w/s on Semantic Annotation: BNC-based corpus, ME expts WEKA 2 release (JSI library integration soon) papers: RANLP, ISWC, Journal of Digital Libraries, Journal of Data and Knowledge Eng. JWS editorial board; co-editor JNLE special RANLP IE tutorial, tutorial on HLT/SW at ESWS HLT/SW evaluation workshop at ECAI OBIE in Multiflora, hTechsight SW NLG in MiAKT (below) Some new stuff 8(12)
MIAKT – NLG for SW RDF input from image annotation GUI... ...generated text MIAKT has important productivity and accuracy implications 9(12)
Ontology-Based IE (OBIE) for semantic tagging of job adverts, news and reports in chemical engineering domain Aim is to track technological change over time Centred around domain-specific ontology Terminological gazetteer lists are linked to classes in the ontology Rules classify the mentions in the text wrt. the domain ontology Annotations output to DB or RDF hTechSight tech oversight 10(12)
OBIE in MultiFlora 2Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Varyingplanttaxa Merged RDF BBSRC project led by Mary McGee Wood, U. Mcr.
(GATE 3 release happening soonish) Continuity guaranteed for AKT phase 2 (€2 million GATE-related work 2004-2007) Some future elements: more and better OBIE, inc. cross-doc co-reference pluggable OWL repository support (now only Sesame; soon 3Store, KAON) large- and huge-scale processing standardisation of the component integration model (ECLIPSE) service-based integration (“SDK” SW API) This talk: http://gate.ac.uk/sale/talks/akt-jan04.ppt What else? You tell us... GATE 4: the Final Conflict 12(12)