520 likes | 731 Views
GATE, SWAN and Semantic TV http://gate.ac.uk/ Hamish Cunningham Department of Computer Science, University of Sheffield. Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice
E N D
GATE, SWAN and Semantic TV http://gate.ac.uk/ Hamish Cunningham Department of Computer Science, University of Sheffield
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 2(52)
Gartner, December 2002: taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: to deal with the information deluge we need formal knowledge in semantics-based systems our communication culture is in informal and ambiguous natural language The challenge: to reconcile these two phenomena The Knowledge Economy and Human Language 3(52)
HLT: Closing the Loop KEY MNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE (M)NLG Semantic Web; Semantic Grid;Semantic Web Services Formal Knowledge(ontologies andinstance bases) HumanLanguage OIE (A)IE ControlledLanguage CLIE 4(52)
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 5(52)
Information Extraction (IE) pulls facts and structured information from the content of large text collections. Contrast IE and Information Retrieval NLP history: from NLU to IE Progress driven by quantitative measures MUC: Message Understanding Conferences ACE: Advanced Content Extraction CoNLL: Conference on Nat. Lang. Learning Pascal (2005): ontology-based IE Information Extraction 6(52)
“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.” ST: rocket launch event with various participants Conventional IE Example • NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets" • CO:"it" = rocket; "Dr. Head" = "Dr. Big Head" • TE: the rocket is "shiny red" and Head's "brainchild". • TR: Dr. Head works for We Build Rockets Inc. 7(52)
Bulgaria London XYZ UK Ontology-based IE XYZ was establishedon 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Location Company HQ partOf City Country type type HQ type type establOn partOf “03/11/1978” 8(52)
Conventional IE tags selected segments of text whenever that text represents the name of an entity OBIE: view enitites as mentions of the underlying instances from the ontology Identify which mentions in the text refer to which instances in the ontology Add new instances if needed Identify instances of attributes and relations take into account what are allowed given the ontology, using domain&range as constraints Ontology-Based IE (OBIE) 9(52)
… Entity Person Job-title president G.Brown minister chancellor … Classes, instances & metadata “Gordon Brown met George Bush during his two day visit. <metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string> <class>…#Person</class> <inst>…#Person12345</inst> </Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string> <class>…#Person</class> <inst>…#Person67890</inst> </Annotation> </metadata> Classes+instances after Classes+instances before Bush 10(52)
… Entity Person Job-title president T. Blair minister chancellor … Classes, instances & metadata (2) “Gordon Brown met Tony Blair to discuss the university tuition fees. <metadata> <DOC-ID>http://… 2.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string> <class>…#Person</class> <inst>…#Person12345</inst> </Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 30 </e_offset> <string>Tony Blair</string> <class>…#Person</class> <inst>…#Person26389</inst> </Annotation> </metadata> Classes+instances after Classes+instances before G. Brown G. Bush 11(52)
Portability – different and changing ontologies Different text types – structured, free, etc. Utilise ontology information where available Train from small amount of annotated text Output results wrt the given ontology bridge the gap demonstrated in S-CREAM Learn/Model at the right level ontologies are hierarchical and data will get sparser the lower we go Challenges for IE for SemWeb 12(52)
Deploying IE Domain specificity vs. task complexity: a necessary trade-off general 100% 90% acceptableaccuracy 80% specificity Performance Level 30% domainspecific complexity complex simple bag-of-words events entities relations 13(52)
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 14(52)
Software lifecycle in collaborative research • 1. Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. • 2. Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. • 3. Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. • 4. Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype..."). • 5. Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry). 15(52)
Physicists have supercolliders; medics have MRI scanners; HLT researchers have.... Perl? Other relevant trends: EU funds multi-site collaborative projects Realisation of role of engineering in scalablility, reusablility, and portablility Support for large data, in multiple media, languages, formats, and locations Promotion of quantitative evaluation metrics Hence GATE, a General Architecture for Text Engineering (est. 1995) Infrastructure and Science 16(52)
An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, a graphical development environment. GATE comes with... Free components, and wrappers for other people's Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL) at http://gate.ac.uk/download/ Used by thousands of people at hundreds of sites GATE, a General Architecture for Text Engineering is... 17(52)
GATE team projects. Past: Conceptual indexing: MUMIS: automatic semantic indices for sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e-science ACE/ TIDES: Arabic, Chinese NE JHU summer w/s on semtagging EMILLE: S. Asian languages corpus hTechSight: chemical eng. K. portal Present: Advanced Knowledge Technologies: €12m UK five site collaborative project SEKT Semantic Knowledge Technology PrestoSpace MM Preservation/Access KnowledgeWeb Semantic Web ETCSL Sumerian Digital Library ENIRAF, MMKM networks Future: New eContent project LIRICS Thousands of users at hundreds of sites. A representative sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs inc. Sirma AI Ltd., Bulgaria DERI, Stanford, Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia... A bit of a nuisance (GATE users) 18(52)
HLT systems composed of components GATE versions: v1: dynamic loading of shared object libraries with Tcl wrappers v2, v3: Java beans with URL loading, XML metadata, produce web services externally v4: core web services (both produce and consume), new LIRICS project out of ISO TC37/SC4(link up with SWS in SDK?) GATE – components and services 19(52)
Combines learning and rule-based methods (new work on mixed-initiative learning) Allows combination of IE and IR Enables use of large-scale linguistic resources for IE, such as WordNet Supports ontologies as part of IE applications - Ontology-Based IE Supports languages from Hindi to Chinese, Italian to German Used in OntoText KIM, SDK, Text2Onto, ... GATE – infrastructure for semantic metadata extraction 20(52)
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 21(52)
D2K (Data 2 Knowledge): data mining / machine learning with visual programming development tool T2K: library of text processing modules built on D2K Integrates data mining methods for prediction, discovery, and deviation detection, with information visualization tools Offers a visual programming environment. Distributed computing / parallel processing facilities. From NCSA: http://alg.ncsa.uiuc.edu/do/tools/t2k Example 1: Massively Parallel Clustering and Classification 22(52)
Email classification results 24(52)
Greenstone: Digital Library with automated ingestion, structuring and indexing Full text and fielded search (Dublin Core) GATE-based entity tagging From Maori to Arabic, Russian to Chinese UNESCO’s Information for All Programme Perseus: One of the oldest and biggest humanities DLs Provides rich interlinking of related resources Models time and space via materials dates and locations GATE-based automated hyperlinking etc. Example 2: Digital Libraries 29(52)
Greenstone 30(52)
Perseus Time-line and geographic visualisation http://www.perseus.tufts.edu/ 31(52)
Multimedia Indexing and Searching Environment Composite index of a multimedia programme from multiple sources in different languages ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA An important experimental result: multiple sources for same events can improve extraction quality PrestoSpace applications in news and sports archiving Example 3: the MUMIS project 32(52)
Semantic Query Not “goal Beckham” (includes e.g. missed goals, or “this was not a goal”) Instead: “goal events with scorer David Beckham” 33(52)
The results: England win! 34(52)
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 35(52)
Collaboration between DERI/NUIG, OntoText and USFD, hosted at DERI Large heap of IBM hardware in your server room Objective: make the cooling fans run flat-out Conceptual indexing of news or other web fractions Quantitative media reporting Annotated web workbench service Custom knowledge services Demo and poster at ESWS SWAN: a Semantic Web Annotator 36(52)
Financial Analysts Indications of how a company is viewed: How many instances predicting strong performance for a particular company are out there? Over the past year how has the profile of predictions for this company changed? How many positive/negative sentiments were expressed for the company? Marketing Strategists Support campaign tuning today based on yesterday's results: In this morning's IT press 7% of articles discussed your company. The average proportion of the article directly relating to your company was 33%. The figures for the other key players in your sector are summarised in the following table.... Extent of media coverage relative to spend events: Company Y exhibited at Comdex. In the week following the exhibition 20% of the press that covered Comdex mentionned Y. SWAN Scenarios (1) 37(52)
PR Workers Identify negative reporting events (to issue denials, obfuscations, bribes etc.): The table below summarises 12 negative reporting events concerning your company in the last 24 hours of IT news.... Media Analysts A range of media metrics, e.g. the "media distance" between concepts and products/companies: The media distance between your company and the subject of XML is 0.09; for IBM the value is 0.2. SWAN Scenarios (2) 38(52)
Sales Generate "black books" - lists of contacts in the organisations for sales staff. Business structures are continually changing and reported in the news. Track works-for and joining and leaving reporting events Public Interest Services In order to generate interest and to prototype the system we may wish to provide a free public service, for example about sport, or theatre and cinema alerts. SWAN Scenarios (3) 39(52)
KIM • Ontology (KIMO) + 200K instances KB (5m stmts) • Lookup phase marks mentions from the ontology • Combined with rule-based IE system to recognise new instances of concepts and relations • High ambiguity of instances with the same label – uses disambiguation step • Special KB enrichment stage where some of these new instances are added to the KB • Disambiguation uses an Entity Ranking algorithm, i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris) 40(52) Popov et al. KIM. ISWC’03
OBIE in KIM 41(52) Popov et al. KIM. ISWC’03
Focussed crawling Focussed crawling Focussed crawling Focussed crawling Focussed crawling Focussed crawling Focussed crawling Focussed crawling IE (32 bit) Focussed crawling SWAN Logical Architecture Web IE (64 bit) Annotation(Oracle) UI Users Web UI,Web services Knowledgebase (Sesame) Service Users 42(52)
Cluster Controller 43(52)
Now Hardware working, crawling and annotating news sites IE tuning and evaluation in progress Next steps Public demonstration service More news, sports domain More languages (parallel corpus, align, project markup, learn recogniser for new language) Negative reporting events SWAN: Status, Future 44(52)
Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice Three examples: parallel data mining, digital libraries; video indexing SWAN: OBIE meets the Web Semantic TV Contents 45(52)
Digital Rights Management (DRM) civilisation as we know it controls how you consume media you buy Has the potential to be linked with censorship and with invasive behaviour logging) You can't make digital objects behave like physical objects - unless you totally control the hardware and the operating system If someone does gain control, then we may end up finding that someone has given the contract for news and culture to Haliburton, for example Trend 1: DRM: end of civilisation as we know it 46(52)
Round 1: Napster's explosion Round 2: Napster's demise Round 3: P2P, Kazaa, BitTorrent Round 4: RIAA sues the punters Round 5: OSN + P2P, trust as referal Seconds out, round 5: file sharing is about to go social 47(52)
Social software hits the mainstream: Friendster, LinkedIn, Orkut (On-line Social Networking, OSN) Bloggs, Wikis, chat/IM, RSS/ATOM How to run a better teleconference: add Wiki and IM Trend 2: the Biggest Innovation in Conversation Since the Table 48(52)
The TV, cable and satelite, DVD, Hifi, radio and Tivo of several years' time will probably run from a single PC (which will also do web, email, ...) There will be a battle between Wintel, offering high-quality gaming and full-blown Windows, and more conventional consumer electronics approaches based on Linux and cheap hardware The latter can probably capture some significant market share, having advantages such as: no viruses; better stability; cheap hardware; multi-user functions; fast boot; quiet running... Trend 3: Wintel vs. Consumer Electronics in the Home 49(52)
What if these three trends combine? What if we get widespread open platform consumer electronics + OSN + P2P file sharing? Ubiquitous on-line communities centred on shared content, with a working model of trust as referral What if semantic technology provides the means of organising and interlinking the cross-over between TV and the web? Killer application for OSN Bandwidth sales for cable companies Antidote to DRM What if...? 50(52)