1 / 16

Potential applications in CLRC/RAL collaborations

Potential applications in CLRC/RAL collaborations. Julian Gallop October 2002. commercial / scientific. Data mining well known in commercial applications should the own brand cornflakes be located next to the beer Less well known in scientific applications

thyra
Download Presentation

Potential applications in CLRC/RAL collaborations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Potential applications in CLRC/RAL collaborations Julian Gallop October 2002 SDMIV workshop – Julian Gallop

  2. commercial / scientific • Data mining well known in commercial applications • should the own brand cornflakes be located next to the beer • Less well known in scientific applications • Among scientists, it’s common to find • “not sure that what I need is data mining, but instead ….” • Perhaps data mining is regarded too narrowly SDMIV workshop – Julian Gallop

  3. Definitions • an early (1991) definition of Knowledge Discovery in databases (KDD) was given as: • "the non-trivial extraction of implicit, previously unknown, and potential useful information from data" (Frawley et. al. 1991). • this was subsequently (1996) revised to: • "the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data" (Fayyad et al 1996). • data mining is one step in the KDD process - concerned with applying computational techniques to find patterns in data SDMIV workshop – Julian Gallop

  4. CLRC scientific fields and collaborations • Sciences: space, earth observation, particle physics, microstructures, synchrotron radiation . . . • Holds (or provides access to) significant data collections • Partnerships between E-science centre, BITD, computational science and science departments • E-science projects include: • Ones that are mainly CLRC (e.g. Data Portal) • UK e-science collaborations (e.g. Astrogrid, NERC Data Grid, gViz) • EU collaborations (e.g. DataGrid) • And also the UK Grid Support Centre SDMIV workshop – Julian Gallop

  5. Sample CLRC e-science project – Data Portal • Data Portal project – pilot project within CLRC: • To enable a scientist to discover, explore and retrieve disparate datasets through one interface, independent of the data location. • CLRC sciences - space science, synchrotron science and neutron science - as well as e-science and IT. • Part of the work is the development of a scientific metadata model SDMIV workshop – Julian Gallop

  6. Sample e-science projects involving CLRC • Astrogrid (UK) • Building a virtual observatory • Ideas on data mining: • Finding: association rules; deviations from a rule; similarity; clustering and classification • Datagrid (EU): aims to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, millions of Gigabytes, across widely distributed scientific communities. • Applications are: biomedical, earth observation, particle physics • NERC Data Grid (UK) SDMIV workshop – Julian Gallop

  7. NERC Data Grid • Funded by NERC & UK e-science core programme • Involves: • CLRC (RAL & DL – including British Atmospheric Data Centre) • Program for Climate Model Data Intercomparison (PCMDI) (U.S. Lawrence-Livermore National Lab) • Relevant to: • energy; water management; food chain; health; weather risk SDMIV workshop – Julian Gallop

  8. NERC Data Grid – relevance to knowledge discovery • Aims to address problem that • at present searching metadata to discover and retrieve what you want is a manual process • Datasets in multiple locations involve multiple logins and retrieval in multiple formats • indicators of success: • that it will be possible to find, reformat and visualize disparate datasets from disparate organisations within one organisation • Ability to test data and comparison ideas without learning foreign formats and establishing personal relationships every time • Clearly will provide a basis for knowledge discovery if successful SDMIV workshop – Julian Gallop

  9. Earth observation instruments • For example ENVISAT • Instrument AATSR • Low orbit, 14/day • Returns to same place every 3 days • Picture shows plume from Mt Etna in 2001 (previous instrument ATSR2) • NASA AQUA TBs/day SDMIV workshop – Julian Gallop

  10. Earth observation patterns • For particular location, what patterns emerge on: • A daily basis • Or a yearly basis • Knowing the conventional pattern day by day, can observe out of the ordinary events e.g. an oil slick SDMIV workshop – Julian Gallop

  11. climateprediction.net • Makes use of spare compute capacity on office and home PC’s to run a climate prediction model • Different PC’s run different parameters and collectively run a Monte Carlo simulation • Results will be studied to find out which subsets of the parameter space correspond to observation • Better understanding of uncertainties • Public understanding of climate change • Oxford U, CLRC RAL, Reading U, with Met Office and OU SDMIV workshop – Julian Gallop

  12. base Latitude 96 Longitude 72 Levels 19 Timesteps calculated every 30mins / 1hr and output for every day over a period of 50 years 17000 registered in advance of launch variables Horizontal velocity Temperature Surface pressure Water vapour (atmosphere) Salinity (ocean) Possible others, such as ocean carbon content and atmospheric ozone and sulphates Data in climateprediction.net SDMIV workshop – Julian Gallop

  13. parameters in climateprediction.net • Physics parameters that may be varied between one run and another: • Representation of cloud variability • Rate at which water droplets collide and cohere • # of nucleation particles for coloud droplet formation • Light scattering in the atmosphere • Cloud convection • Surface processes such as rate of transpiration by plants • Also, runs will be duplicated to detect tampering SDMIV workshop – Julian Gallop

  14. Data distribution in climateprediction.net • Results dataset will be distributed at several (possibly 20) climate modelling institutions • A subset of data is returned from a PC to a data server. Remainder is therefore kept on the (home or office) PC and available – if the owner so chooses. • Program attempting to data mine needs to be isolated from these details, by appropriate portal, metadata and/or catalogue SDMIV workshop – Julian Gallop

  15. Climateprediction.net questions • Some questions that need to be askable • What features of the response are robust as we change the physics? • What kind of changes have similar effects to each other? • What models that are consistent with current observations give changes in extreme events in the future • Unclear whether this is data mining in strict sense, but certainly multivariate statistical techniques SDMIV workshop – Julian Gallop

  16. Summing up • NERC Data Grid project, for example, exposes current difficulties of doing data mining on large scientific datasets • In commercial situation, data is warehoused under single operational control • In science, access is needed to different datasets which are under different managements • Multiple logins, multiple metadata systems • Current e-science projects are providing a mechanism, which future data mining could use • Applications include: earth observation; particle physics; astronomy; biology; . . . . . SDMIV workshop – Julian Gallop

More Related