190 likes | 332 Views
Discovery Net. Discovery Net. Yike Guo [1] , John Darlington (Dept. of Computing), John Hassard (Depts. of Physics and Bioengineering) Bob Spence (Dept. of Electrical Engineering) Tony Cass (Department of Biochemistry), S evket Durucan (T. H. Huxley School of Environment)
E N D
Discovery Net Discovery Net • Yike Guo[1], John Darlington (Dept. of Computing), • John Hassard (Depts. of Physics and Bioengineering) • Bob Spence (Dept. of Electrical Engineering) • Tony Cass (Department of Biochemistry), • Sevket Durucan (T. H. Huxley School of Environment) [1] Contact address: yg@doc.ic.ac.uk
Sensors Data to Information Information to knowledge Modelling & Simulation Experimental Facilities Instrumentation Systems Enabling Technologies Multidisciplinary programmes Partnership between academia & industry Acquisition Maintenance of audit trail Representation Knowledge Re-use Meta data Warehousing & Cleaning Intuitive user interfaces Structure & Organisation IR & Knowledge Discovery Domain ontolgies Re-usable s/w components Data mining & analysis tools Intelligent IR Advanced visualisation Intelligent agents Meeting the LTTR challenge Sharing Data & Information to Create Knowledge
Distributed Reference DBs Distributed Users Collaborative applications Distributed Devices Distributed warehousing High Throughput Sensing Characteristics • Different Devices but same computational characteristics • Data intensive & • Data dispersive • large scale, • heterogenous • distributed data • Real-time data manipulation Need to • calibrate • Integrate • analyse Discovery issues:Distributed Knowledge Discovery, Management Incremental, Interactive Discovery & Collaborative Discovery Information issues:annotations semantics, reference, integrated view of data Data issues:different measurements for same object: Data registration, normalisation, calibration & quality control GRID issues:wide area, high volume, scalability (data, users), collaboration
DNet Architecture High Throughput Sensing (HTS) Applications Large-scale Dynamic Real- time Decision support Large-scale Dynamic System Knowledge Discovery Based on Kensington Discovery Platform Grid-based Knowledge Discovery Grid-based Data Mining, Collaborative Visualisation Information Structuring Information Integration & Composition, Semantics & Domain-based Ontologies, Sharing Distributed Data Engineering Data Registration, Data Normalisation, Data Quality Based on Globus & ORB Infrastructure High Throughput Computing Services Utilising Grid Infrastructure for HT Computing Grid Basic Infrastructure Globus/Cordon/SRB
Test bed Performance Requirements Giga Byte mining: >100,000,000,000 GBytes Mega column feature >1,000,000 columns Tera Byte warehousing > 20 TB Tera Flop processing > 1 Tflops Real-time deployment: < 1 msecs (power grid reaction time ) Device scalability: > 10,000 HTD (e.g. sensors) User scalability: > 100 scientists performing concurrent analysis
End devices Floor switches Central Computing Facilities Building Router Switches workstation cluster wireless SMP Core Router Switches storage • Access to disparate off-campus sites: IC hospitals, Wye College etc. Proposed Firewall London MAN/ JANET The IC Advantage The IC infrastructure: microgird for the testbed Over than 12000 end devices 10 Mb/s – 1Gb/s to end devices ICPC Resource 1 Gb/s between floors 150 Gflops Processing 10 Gb/s to backbone >100 GB Memory 10 Gb/s between backbone router matrix and wireless capability 5 TB of disk storage £3m SRIF funding Network upgrade +20 TB of disk storage 2x1Gb/s to LMAN II (10Gb/s scheduled 2004) +25 TB of tape storage 3 Clusters (> 1 Tera Flops)
Throughput (GB/s) Size (petabytes) Node Number operations Testbed Applications HTS Applications Large-scale Dynamic Real- time Decision support Large-scale Dynamic System Knowledge Discovery 1-10 1-10 >20000 Structuring Mining Optimisation RT decisions • Renewable energy Applications • Tidal Energy • Connections to other renewable initiatives • (solar, biomass, fuel cells), & to CHP and baseload stations • Remote Sensing Applications • Air Sensing, GUSTO • Geological, geohazard analysis 1-100 10-100 >50000 Image Registration Visualisation Predictive Modelling RT decisions • Bio Chip Applications • Protein-folding chips: SNP chips, Diff. Gene chips using LFII • Protein-based fluorescent micro arrays 1-1000 10-1000 >10000 Data Quality Visualisation Structuring Clustering Distributed Dynamic Knowledge Management
High Throughput systems Gene Function: 100k genes 320 cell types 5000 stimuli 8 concentrations 2 replicates 8 x 1012 data points= 10 petabytes. This exceeds LHC data written per year. Proteomics will produce much more data! LFII approach will produce about 50 times the bandwidth of conventional techniques. This approaches LHC trigger data in bandwidth! Functional genomics Proteomics Bioinformatics Protein Data Bank (PDB), which maintains data on the three-dimensional (3D) structure of biological macromolecules, is doubling in size every 18 months. Biotechnology and Discovery Net Genbank, the DNA sequence database maintained by the National Center for Biotechnology Information (NCBI), is doubling in size every 21 months.
Protein and gene databases Our LFII approach will enlarge the number and size (x100?) of these dBs. Our goal will be to establish QC, and backward compatibility with legacy Dbs
Geo-hazard prediction Each pixel of a radar image contains information on the phase of the signal backscattered from the underlying surface. By utilizing the geometry provided by two marginally displaced, coherent observations of the surface, phase difference between the two observations can be related to surface height. Furthermore, by repeated observation, it is possible to measure surface displacements of scattering features that have been slightly shifted (due to an earthquake for example), or that are moving continuously but relatively slowly (such as ice sheets and glaciers). Sensors at different wavelengths Blue, red, infrared, thermal infrared Reference data (e.g. elevation models, map data) Environment Agency Multi-spectral data Monitoring geo-hazards: we analyse temporal changes on soil erosion to predict land slides and floods 5-6 Gbytes/day of image data sensed at different wavelengths at 30meters/pixel for 180x180 Km area Terabytes 1 meter/pixel to cover UK The useful information comes from time-resolved correlations from other sensors, and with other environmental data sets: LANDSAT ASTER IKONOS ERS (5 Gbytes/scene) Airborne Radar INSAR
Large-scale urban air sensing applications GUSTO GUSTO Each GUSTO air pollution system produces 1kbit per second, or 1010 bits per year. We expect to increase the number (from the present 2 systems) to over 20,000 over next 3 years, to reach a total of 0.6 petabytes of data within the 3-year ramp-up. The useful information comes from time-resolved correlations among remote stations, and with other environmental data sets. NO simulant 6.7.2001 You are here
There is large potential in embedded • generation renewable sources – • they will dominate in new build over • usual baseline (nuc., hydro and carbon) • power stations. Decentralised power • is the new paradigm. • Renewable sources include: solar, wind, • tidal, biomass and must be combined • with baseload and CHP etc. • Renewables characterised by • large number of small units, • often in remote areas • wireless connectivity • fluctuating,unpredictable loading • As total exceeds 12% grid control • becomes very difficult • without RT e-grid. Electrical grid Grid structure, the current regulatory and charging regimes for the electricity supply industry were set up to cater for centralised generation and are often not appropriate for smaller plant, connected directly into the distribution network. Deregulation, pollution control standards, and need for great power quality have severe implications for ‘seven nines’ users eg ISPs, requiring 99.99999% uptime (3s down p.a.)! Sun Microsystems: $1m per minute power downtime. EPRI shows that 100 transients per month cost US $30-$50bn per year. • active management, • dispatch metering RT monitoring, • RT control, • minute to minute security, • pan network optimisation. • This requires very high bandwidth • RT remote station data acquisition, • warehousing and analysis.
Deliverables I Testbed: • High Throughput Computing Services • Transparent Utilisation of Distributed Processing and Storage facilities • High Volume Computation based on grid software (globus + cordon) • Object abstraction framework • Resource Pooling and Sharing • Efficient Resource Discovery and Utilisation (brokering, application level scheduling) • Utilisation of Grid Services • Condor cycle-stealing • Globus
Users / Client Visualisation Tools Knwoledge Discovery Services Sensors/ Devices Information Structuring Services ReferenceData Sources Data Engineering Services Computing Servers Deliverables II Testbed: • Knowledge Discovery Services • Distributed data engineering • Data registration, Normalisation, Quality & Control • Information Structuring & Composition • Application-oriented information structuring • Domain Specific Ontologies and Information Composition • Large-scale distributed mining • Grid-based data mining algorithm • Knowledge management and auditing • Collaborative visualisation
Deliverables IIIService Applications • Virtual Cell : cell function modelling based on functional genomics (gene chip), protein expression and protein-protein interaction data (John Hassard, Tony Cass, Jeff Harford) • Environment Vista: Remote sensing analysis environment real time pollution analysis, modelling and visualisation for Urban (London) milieu with upgrade path mapped for pan-Europe roll-out (Ray Wrigley). • Global Geohazard e-Grid: Optimised high bandwidth analysis and visualisation framework for European hazard monitoring (Steve Durucan). • UK Regional Power Quality e-Grid: Load balancing and grid optimisation for simulated increasing renewable power loading (Geoff Rochester).
Industry Connection : 4 Spin-off companies + > 100 related companies (AstraZeneca, Pfizer, GSK, Cisco, IBM, HP, Fujitsu, Gene Logic, Applera, Evotec, International Power, Hydro Quebec, BP, British Energy, ….) Innovation : > 30 patents + world class cross disciplinary research Wide coverage of LLTR fields Long lasting working relationship : close collaboration for making deliverables The Consortium
Milestones • 9 months • Basic middleware, DAQ, collection data registration, information structuring for at least two fields (demo in Supercomputing 2002) • 18 months • Scalable middleware, data normalisation, data integration, Information composition, distributed mining and visualisation structure • Integrated with USA infrastructrue • 27 months • online data quality control, ontologies and data reference, scalable mining, demo of virtual cell and environment vista • 36 months • Packged D-Net as an e-science platform for national grid deployment, demo of all applications
Hardware : sensors (photodiode arrays, hybrid photodiodes, PMTs), systems (optics, mechanical systems, DSPs, FPGAs) Software (analysis packages, algorithms, data warehousing and mining systems) Intellectual Property: access to IP portfolio suite at no cost (starting with 32 international patents) Data: raw and processed data from biotechnology, pharmacogenomic, remote sensing (GUSTO installations, satellite data from geo-hazard programmes) and renewable energy data (from our own remote tidal power systems) People : > 8 scientists Industrial Contribution
Discovery Net Project Management • Project PI and co-ordinator: Yike Guo • Project director and strategist: John Darlington • Applications co-ordinator: John Hassard • LFII biotechnology: J Harford • Protein chips: T Cass • Geohazard: S Durucan • Remote sensing: R Wrigley • Renewable energy: G Rochester • Project operation manager: Moustafa Ghanem • We will establish a Scientific Advisory Board by month 6.