The National Centre for Text Mining

The National Centre for Text Mining …and its ramifications for e-Science and the other way round Anne E Trefethen Deputy Director, e-Science Core Programme National Centre for Text Mining

A Definition of e-Science ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ John Taylor Director General of Research Councils Office of Science and Technology National Centre for Text Mining

Licklider’s Vision for the Internet “Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job.” Larry Roberts – Principal Architect of the ARPANET National Centre for Text Mining

UK e-Science Programme £250m government investment over 5yrs Technical Advisory Group Director’s Awareness and Co-ordination Role Director’s Management Role Generic Challenges EPSRC (£15m) £16.2m, DTI (£15m) Pilot Application Programme PPARC (£26m) £31.6m BBSRC (£8m) £10.0m MRC (£8m) £13.1m NERC (£7m) £8.0m ESRC (£3m) £10.6m EPSRC (£17m) £18.0m CLRC (£5m) £5.0m Research Councils (£74m),£96.3m DTI (£5m) Collaborative projects Industrial Collaboration National Centre for Text Mining

Powering the Virtual Universehttp://www.astrogrid.ac.uk(Edinburgh, Belfast, Cambridge, Leicester, London, Manchester, RAL) AstroGrid Slides courtesy of Nick Walton, Cambridge Multi-wavelength showing the jet in M87: from top to bottom – Chandra X-ray, HST optical, Gemini mid-IR, VLA radio. AstroGrid will provide advanced, Grid based, federation and data mining tools to facilitate better and faster scientific output. Picture credits: “NASA / Chandra X-ray Observatory / Herman Marshall (MIT)”, “NASA/HST/Eric Perlman (UMBC), “Gemini Observatory/OSCIR”, “VLA/NSF/Eric Perlman (UMBC)/Fang Zhou, Biretta (STScI)/F Owen (NRA)” National Centre for Text Mining

SWIFT satellite observes gamma ray burst Gamma Ray Bursts Image from ESO D. Ducros, ESA Image + IRIS data Interaction with observatory pipe- lines Localise GRB alert in minutes – as fade rapidly. Collate data from multiple telescopes over months - meta data issues Large computational photometric redshift calcs on multi-λ > gives distance Cross reference multi-λ data – ID pre-cursor and or environment Compare against SN light curves – bump shows eveidence for a SN in the GRB (Price et al, 2002) Reprocessing of ionospheric STP data change coords from earth to celestial National Centre for Text Mining

Dark Matter + Large Scale Structure X-ray cluster: Chandra X-ray (Mullis) overlaid on a deep BRI image (Clowe & Luppino). Image from ESO Multi-TB λCDM models, e.g. Millennium Sim Automatic cluster finding techniques Multiple large image sources: registration & association Generate Shear Maps c.f. CDM models > DM distribution with redshift Remove stars correlate gals with z Source ID from multiplexed spectral data Colour-Colour relationships classification in multi-phase space National Centre for Text Mining

Some facts on Astronomy data • Virtual observatories • Many national virtual observatories containing data at different wavelengths. Estimated • US NVO project alone will store 500 Terabytes/year • Laser Interferometer Gravitational Observatory (LIGO) generates 250 Terabytes/year • VISTA, Visible and infrared survey telescope estimated to generate 250 Gigabytes of raw data/night – 10 terabytes of stored data/year. • Together with data analysis need to combine with previously published knowledge on that astronomical time/space events National Centre for Text Mining

The eDiaMoND Project eDiamond Slides courtesy of David Gavagahn, Oxford University Relations Life Sciences Worldwide Grid Hardware, Software and People Skills eDiaMoND Breast Screening Programmes People Skills Engineering and Physical Sciences Research Council Medical Research Council National Centre for Text Mining

UK Breast Screening – Today Paper Began in 1988 Women 50-64 Screened Every 3 Years 1 View/Breast ~100 Breast Screening Programmes - Scotland - Wales - Northern Ireland - England Film 1,300,000 - Screened in 2001-02 65,000 - Recalled for Assessment 8,545 – Cancers detected 300 - Lives per year Saved 230 - Radiologists (Double Reading) National Centre for Text Mining Statistics from NHS Cancer Screening web site

UK Breast Screening – Challenges Digital Began in 1988 Women 50-70 Screened Every 3 Years 2 Views/Breast + Demographic Increase ~100 Breast Screening Programmes - Scotland - Wales - Northern Ireland - England Digital 2,000,000 - Screened every Year 120,000 - Recalled for Assessment 10,000 - Cancers 1,250 - Lives Saved 230 - Radiologists (Double Reading) 50% - Workload Increase National Centre for Text Mining

Previous Current UK Breast Screening – Workflow Missed 1 Interval Cancers Cancer 6 Call 1000 Recall 40 Screening Assessment All Clear 960 All Clear 34 Training ~100 Breast Screening Programmes Epidemiology National Centre for Text Mining

Previous Current Screening Screening Screening Data Screening Diagnosis Diagnosis Diagnosis Teaching Teaching Teaching Training Compute Epidemiology Epidemiology Epidemiology Standard Mammo Format CADe CADi Data Mining Epidemiology eDiaMoND – Scope Workstation Grid 32 MB / Image 256 TB / Year ~4 Breast Screening Programmes National Centre for Text Mining

eDiaMoND – Compute Mammograms have different appearances, depending on image settings and acquisition systems Temporal mammography Computer Aided Detection Standard Mammo Format 3D View National Centre for Text Mining

Data DICOM DICOM DICOM DICOM Compute Standard Mammo Format CADe CADi Data Mining Patient Age … Image 107258 55 … 1.dcm 236008 62 … 2.dcm ……… 700266 ……… ……… ……… ……… ……… ……… ……… … … … 59 … … … … … … … … … … … … … … …….. …….. …….. …….. …….. 3.dcm …….. …….. …….. 895301 58 … 4.dcm eDiaMoND – Data Images Data Grid Logical View is One Resource National Centre for Text Mining

myGrid:Directly Supporting the e-Scientist myGrid slides courtesy of Carole Goble Partners Manchester, EBI, Southampton,Nottingham, Newcastle, Sheffield AstraZenecaGlaxoSmithKline Merck KGaA Epistemics LtdGeneticXchangeNetwork Inference IBM SUN Microsystems National Centre for Text Mining http://mygrid.man.ac.uk

(courtesy of Carole Goble, Manchester) myGrid Project • Imminent ‘deluge’ of genomics data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives National Centre for Text Mining

(courtesy of Carole Goble, Manchester) Information Weaving • Large amounts of different kinds of data & many applications. • Highly heterogeneous. • Different types, algorithms, forms, implementations, communities, service providers • High autonomy. • Highly complex and inter-related, & volatile. National Centre for Text Mining

(courtesy of Carole Goble, Manchester) People Provenance record of workflow runs Literature Notes Data in and out Services used An in silico experiment = a web of interconnected information and components Provenance of the workflow template. Related workflows. Ontologies describing workflows National Centre for Text Mining

The eBank Project • Building links between e-research data, from the CombeChem project, with scholarly communication and other on-line sources • Investigating the role of aggregator services in linking data-sets from Grid enabled projects to open data archives contained in digital repositories through to peer-reviewed articles as resources in portals • JISC-funded project led by UKOLN in partnership with the Universities of Southampton and Manchester National Centre for Text Mining

Virtual Learning Environment Reprints Peer-Reviewed Journal & Conference Papers Technical Reports LocalWeb Preprints & Metadata Institutional Archive Publisher Holdings Certified Experimental Results & Analyses Data, Metadata & Ontologies Undergraduate Students Digital Library Graduate Students E-Scientists E-Scientists E-Scientists Grid 5 E-Experimentation Entire E-Science CycleEncompassing experimentation, analysis, publication, research, learning National Centre for Text Mining

Generic Issues • In next 5 years e-Science projects will produce more scientific data than has been collected in the whole of human history NSF “Atkins” report on Cyberinfrastructure • the primary access to the latest findings in a growing number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’. • ‘archives containing hundreds or thousands of terabytes of data will be affordable and necessary for archiving scientific and engineering information’. National Centre for Text Mining

Generic Issues cont • Data Deluge from e-Science projects requires grid technologies to facilitate discovery, analysis, curation of data • Sheer volume of text published and new results appearing, is impossible for researchers to read and correlate • Effective automated processing required research, locate, gather and make use of knowledge encoded electronically in available literature National Centre for Text Mining

Bioscience and biomedicine • Bioscience and biomedicine resulted in huge volume of domain literature • Open Acess publishers such as BioMed Central have a growing number of full-text articles • Integration of literature and data analysis of increasing importance - linking factual biodatabases to literature, using publishers to check, complete or complement contents of such databases National Centre for Text Mining

NaCTeM establishing high-quality service provision in text mining for academic community – focus initially on biological and biomedical science • Enabling e-Science applications! National Centre for Text Mining

Grid Technologies enabling Text mining • Text mining process involves many steps • Potentially many tools • Large amounts of text and data to be analysed • Requring temporary storage of intermediate results • Access large resources, ontologies, document collections etc • Compute-intensive algorithms • Portal access to data and compute resources National Centre for Text Mining

Conclusions • The vastness of the amount of electronic literature and digital text demands automatic capabilities for effective analysis • Combining this capability with data analysis is of growing importance for some research areas • The future services provided by NaCTeM will form a significant piece of the toolset for e-Science applications National Centre for Text Mining

Acknowledgements Thanks to • Carole Goble and the myGrid team • Liz Lyon and the eBank team • Dave Gavaghan and the eDiamond Project • Nick Walton and AstroGrid National Centre for Text Mining

The National Centre for Text Mining

The National Centre for Text Mining

Presentation Transcript

Text Mining

The Research Assistant for Biological Text Mining

NLP for Text Mining

Text mining- text analytics- data mining

Text Mining Overview

Text Mining

National Center for Text Mining Launch Event

Text Mining

Text Mining

Text Mining

Comparative Text Mining

Statistical Methods for Text Mining

Text Mining