250 likes | 276 Views
e-Science and Datacentric Frameworks. Hyunseung Choo Sungkyunkwan University http://monet.skku.ac.kr choo@skku.ac.kr. e-Science and its examples. e-Science. ‘ e-Science ’ is about global collaboration in key areas of science, and the next generation
E N D
e-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University http://monet.skku.ac.krchoo@skku.ac.kr
e-Science ‘e-Science’is about global collaboration in key areas of science,and the next generation of infrastructure that will enable it. ‘e-Science’will change the dynamic of the way scienceisundertaken. Director General of Research Councils Office of Science and Technology John Taylor
GRID vs. e-Science <KIPS Review, May, 2003>
From Networking to Grid Computing • Exponential Growth of Network Technology • Network vs. Computer Performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000
The Driver for e-Science • More and more data • Instrument resolution doubling / 12 months • Instrument and telemetry speeds increasing • Mobile sensors & radio digital networks • Storage capacity doubling / 12 months • More and more computation • Computations available doubling / 18 months • Faster networks can change methods • Raw bandwidth doubling / 9 months • These integrate and enable • More interplay between computation and data • More collaboration: scientists, medics, engineers, etc. • More international collaboration
The New Behavior • Shared Infrastructure • Intrinsically distributed • Intrinsically multi-organizational • Multiple uses interwoven • Shared Software • A new attempt at making distributed computing economic, dependable and accessible • Scientists from all disciplines share in its design and use • Shared & Automated System Administration • Replicated farms of replicated systems • Autonomic management • Immediate Benefits • Faster transfer of ideas and techniques between disciplines • Amortization of development, operation and education
Examples on e-Science Earth Observation Systems severe weather predictions, climate variations, flood monitoring, earthquakes, and tsunami (a tidal wave) Virtual Observatories Robotic Telescopes Bioinformatics / Functional genomics Collaborative Engineering Medical / Healthcare informatics TeleMicroscopy, and so on
Example 1 – Earthquake Simulation NEESgrid National infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other. Argonne, Michigan, NCSA, UIUC, USC
Example 2 – Airspace Simulation NASA Information Power Grid (IPG) Aircraft, flight paths, airport operations and the environment are combined to get a virtual national airspace
e-Science (USA) • Cyber infrastructure program like “e-Science community” for federal offices, supercomputing centers, and research institutes • Budget in 2003 : U$ 1.1 billion • e-Science Cases • Telescience Portal : X-ray related applications including Microbioanalysis • NASA IPG (Information Power Grid) : Aircraft simulation and analysis to reduce the design processing time • BIRN(Biomedical Informatics Research Network) : Study on human and animal brains for the new era in medical science
BIRN(Biomedical Informatics Research Network) • Processing Pipelines for Morphometric Analysis • Medical Applications for HPC • non-linear registrations • biomechanical simulations • statistical analysis of large populations
AccessGridalways-on video walls e-Science Centre (UK)
e-Science Pilot Project (UK) (1/2) • Many to one project • Particle Physics and Astronomy Research Council (PPARC) • GridPP: A prototype Grid infrastructure for the CERN Large Hadron collider • AstroGrid: A Grid based Virtual Observatory • Biotechnology and Biological Sciences Research Council (BBSRC) • Medical Research Council (MRC) • Natural Environment Research Council (NERC) • Grid for Environmental Systems Diagnostics and Visualization • Climateprediction.com: Distributed computing for global climate research • Environment from the Molecular Level: Modeling the atomistic processes involved in environmental issues
e-Science Pilot Project (UK) (2/2) • Economic Social Research Council (ESRC) • Engineering and Physical Sciences Research Council (EPSRC) • The Reality Grid: a tool for investigating condensed matter and materials • Comb-e-chem: Structure-Property Mapping: Combinatorial Chemistry and the Grid • DAME: Distributed Aircraft Maintenance Environment • GEODISE: Grid Enabled Optimization and Design Search for Engineering • Discovery Net: An e-Science Testbed for High Throughput Informatics • MyGrid: Directly Supporting the e-Scientist • Council for the Central Laboratory of the Research Councils (CLRC)
e-Science (JP) • IT-based laboratory (ITBL), Grid based fundamental Informatics (A05), 100 Teraflop high performance computing (NAREGI) • All led by Ministry of Education, Culture, Sports, Science, and Technology (문부과학성) • e-Science Cases • ITBL : Project for virtual research environments • A05 : Grid computing project • NAREGI : Integrating distributed computing resources by high performance networks for 100 Teraflop HPC
ITBL (IT-Based Laboratory) • 6 Organizations at ITBL • Japan Atomic Energy Research Institute (JAERI) 일본원자력 연구소 • RIKEN (The Institute of Physical and Chemical Research) 이화학연구소 • National Institute for Materials Science (NIMS) 재료 물질 연구 기구 • National Aerospace Laboratory of Japan (NAL) 항공우주기술연구소 • National Research Institute for National Research Institute for Earth Science and Disaster Prevention (NIED) 방재과학기술연구소 • Japan Science and Technology Corporation (JST) 과학진흥 사업단 • Massive collaborative research environment for remote researchers by SuperSINET based on IT infrastructure
e-Science (CN) • Grid Projects in China (2002-2005) • The Ministry of Science & Technology 863 Grid Project • Grid Enabling Cluster (>4 Tflop/s) • Grid Nodes (Total 6-10 Tflop/s) • Grid Software (Grid OS, Developer and User Environment) • Grid Applications in Science, Manufacturing, Service industry, and Environment/Resource sector • The “Next Internet” Project (led by Chinese NSF) • Upgrade network infrastructure • Basic research in computing, data and access grids • The Chinese Academy of Sciences e-Science Grid • The Beijing City Manufacturing Grid
Three different kinds of grids • Computational grids • These represent the natural extension of large parallel and distributed systems, and exist to provide high-performance computing • Access grids • This requires managing access to many specific, small resources that are actually located inside large, complex, organizational computer systems and networks • Data grids • These exist in order to allow large datasets to be stored in repositories and moved about with the same ease that small public files can be moved today ☼ Datacentric grids
Facts about online data • They are big and growing fast • Data stored online quadruples every 18 months. • Process power ‘only’ doubles every 18 months. • They are naturally distributed • Data is captured via multiple channels • Operating systems struggle to handle files larger than a few GB • They are hard to move • Pragmatics: Few sites have enough swap space to handle the arrival of a terabyte dataset for temporary use • Performance • Politics: Data about individuals cannot be moved out of jurisdictions with strong privacy rules
Implications of datasets that are large, distributed, and immovable • It’s much more effective to divide programs into separated pieces and send them to data • This requires a datacentric view of computation, rather than the conventional processor-centric view. • A new programming model is needed • Applications must be decomposable • The results of (partial) computations must be small enough to move around • These condensed forms are worth keeping • Execution nodes must be able to provide both computing cycles and high-performance data access.
Some properties • Users can be productiveeven from a thin client • Applications require only thin pipes within the internet • Code mobility is essential • The format and content of a data repository will often be unknown to an application until it actually starts accessing it • Applications will tend to be standardized • Applications will often be built from templates, perhaps even expressed using a query language • Re-execution of an application on a different or updated dataset will be common • There will be increased sensitivity about information leakage
Conclusion • e-Science and datacentric grid are strongly coupled • Meteorology data require dataqcentric grid computing in the future • Typical e-Science characteristics • Huge data size • Poor data site accessibility • Experts are spread over the country/world • Basically all are based on reliable networks • Exact computing on network probabilistic connectivity (one aspect of reliability measures) is theoretically hard • Fast approaches and good enough approximation algorithm are developed (will be published)