National Centre for Text Mining: Activities and Plans

UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.nactem.ac.uk CNI, 3rd April 2006 Slide 1

Overview Text Mining? NaCTeM Consortium Components Service Infrastructure Future Work CNI, 3rd April 2006 Slide 2

Centre for ... National Centre for ... what was that? TEXT Ticks Mining! CNI, 3rd April 2006 Slide 3

... Text Mining? • Text Mining: No canonical definition • Commonly used definition based on Data Mining: • “The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.” “The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.” CNI, 3rd April 2006 Slide 4

... Text Mining? • Typical Data Mining Functions: • Classification • Association Rule Mining • Clustering • Useful when applied to texts, but doesn't fulfill the definition as they don't discover “facts”. • Information Retrieval also doesn't discover facts. CNI, 3rd April 2006 Slide 5

... Text Mining? Need to understand the meaning of the text: Part of Speech tagging Clauses Named Entity Recognition Find correlations of entities Infer information from logical chains Result: New Knowledge CNI, 3rd April 2006 Slide 6

Other Benefits Plus a lot more: Improved document classification Automatic semantic annotation of documents Improved access -- search by semantics and concepts Improved clustering of documents by concept Summarization Visualization techniques CNI, 3rd April 2006 Slide 7

Event Extraction Extract events from the text along with information about the participants Can be modeled as relationships between named entities Extracting events allows discovery of hidden temporal correlations eg: Google refuses to announce plans. Google's stock falls. Improves understanding of the semantics, improving the functions based around those semantics CNI, 3rd April 2006 Slide 8

NaCTeM Hosted at University of Manchester Participants: Universities of Manchester, Liverpool, Salford Plus: San Diego Supercomputer Centre, University of Tokyo, University of Geneva, University of California Berkeley Six full time posts for 3 years (2005-2007) Plus active board of directors and experts Current Director: Professor Jun'ichi Tsujii from U.Tokyo Funding: JISC, BBSRC, EPSRC CNI, 3rd April 2006 Slide 9

NaCTeM Aims Provide text mining oriented services Facilitate access to text mining resources User support, advice, training and consultancy Participate in international research Formulate best practice guidelines Increase awareness of text mining in all domains Develop links with industrial partners involved in text mining CNI, 3rd April 2006 Slide 10

Components Liverpool: Cheshire3 (Information framework) Manchester: CAFETIERE (Entity recognition, event extraction) Salford: TerMine (Automatic term recognition) SDSC: Storage Resource Broker (Data grid) UC Berkeley: Cheshire, TM/IR expertise U.Tokyo: GENIA, ENJU (Text analysis tools) U.Geneva: User studies and evaluation CNI, 3rd April 2006 Slide 11

Cheshire3 Information Processing Framework Liverpool and UC Berkeley Standards based: XML, SRU, Unicode, etc. Scalable: Single machine to Grid (PVM, MPI, SRB) Extensible: Python + C, Object Oriented with stable API Work ongoing to integrate Data Mining tools and other information processing applications CNI, 3rd April 2006 Slide 12

Cheshire3 Examples Integrated tools from other participants in preparation for NaCTeM service infrastructure. Medline: 4350 records/second using 60 concurrent processes on SDSC's Teragrid cluster 440 seconds to index 1 field from 16 million MARC records Distributed network of Archival Descriptions in the UK NARA ERA prototype system with SDSC CNI, 3rd April 2006 Slide 13

CAFETIERE Entity Recognition and Annotation University of Manchester Discovers named entities in part of speech tagged text Discovers temporal events referring to those entities Integration of ontologies and term processing Rules based CNI, 3rd April 2006 Slide 14

CAFETIERE Example CNI, 3rd April 2006 Slide 15

TerMine Automatic Term Recognition University of Salford/Manchester Discovers important terms Assigns 'C-value' score to rank terms Interaction with terminology databases for term management CNI, 3rd April 2006 Slide 16

TerMine Example CNI, 3rd April 2006 Slide 17

U. Tokyo Tools Natural Language Parsing University of Tokyo Tagger, Chunker, ENJU, GENIA Necessary for any text mining application Fast and accurate http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/ http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/ CNI, 3rd April 2006 Slide 18

Tokyo Tools Example CNI, 3rd April 2006 Slide 19

Tokyo Tools Example2 CNI, 3rd April 2006 Slide 20

Service Infrastructure • NaCTeM will allow UK researchers to perform text mining on their own data in combination with other accessible resources (eg other data sets, ontologies etc) • Requirements: • Lots of processing power • Lots of storage capacity • Easily extensible/configurable service framework • Access to cutting edge TM, DM and IR tools CNI, 3rd April 2006 Slide 21

Service Infrastructure Processing provided by UK National Grid Service Data Storage via SDSC's Storage Resource Broker Important to store multiple versions of each document Cheshire3 provides the Grid enabled information infrastructure Plus information retrieval and data mining tools Manchester and Tokyo provide the text mining tools Stable tools integrated into Cheshire3 already CNI, 3rd April 2006 Slide 22

Service Infrastructure • Initial NaCTeM services will be focused on the bio domain: • Bio-informatics is a growing field • Interest from both academic and corporate sectors • Large datasets/services available (MeSH, Medline, ...) • Web portal interaction • Then expand into other areas, such as Social Sciences and Historical text analysis. CNI, 3rd April 2006 Slide 23

Future Work Services for other domains GUI Workflow configuration Integration of user developed services and applications Maximizing workflow potential with 'smart' components Standardizing annotation schemas Conference/Workshop Other? CNI, 3rd April 2006 Slide 24

Thank You Questions? ... Reception! CNI, 3rd April 2006 Slide 25

National Centre for Text Mining: Activities and Plans

National Centre for Text Mining: Activities and Plans

Presentation Transcript

April 2006

April 2006

CNI Spring meeting – April 1, 2014

April 2006

April 2006

CNI, 3rd April 2006 Slide 1

3rd Grade Algebra Slide Show

April 3rd, 2007

Slide show - April 11, 2006

3rd April 2006 Helsinki, Finland

April, 2006

CNI

Tuesday April 3rd

April 2006

April 2006

April, 2006

CNI, 4th April 2006 Slide 1

April 2006

CNI, 4th April 2006 Slide 1