350 likes | 363 Views
This article discusses the advanced cyber-infrastructure in the Chinese Academy of Sciences (CAS) and its applications in data-intensive e-Science. It covers the challenges and opportunities of managing scientific data deluge, the national scientific data sharing initiatives, and the advanced cyber-infrastructure for data lifecycle in CAS.
E N D
The 4th China-US Roundtable on Scientific Data CooperationAdvanced Cyber-infrastructure for Scientific Data Applicationsin CAS Tieniu TAN Deputy Secretary-General Chinese Academy of Sciences (CAS) 29 Mar. 2010, Irvine, USA
Outline • Background • Advanced Cyber-Infrastructure in CAS • Typical Data Intensive e-Science Applications in CAS • Conclusion
ScientificDataDeluge • Scientists face a data deluge • Vast volume of scientific data captured by large scientific facilities, ubiquitous sensors, new instruments and computer models • Science and engineering research have become increasingly data-intensive • New scientific opportunities are emerging from increasingly effective data organization, access and usage (NSF, 2007)
Data-intensive scientific discovery:e-Science • The fourth paradigm: data-intensive scientific discovery (Microsoft, 2009) • A Transformed Scientific Method • e-Science is synthesis of information technology and science, giving priority to scientific data lifecycle and data exploration (Jim Gray) • data captured by instruments or generated by simulator; processed by software; information/knowledge stored in computer; scientist analyzes database / files; using data management and statistics
China National Scientific Data Sharing Initiatives • Ministry of Science and Technology (MOST) started the implementation of Scientific Data Sharing Program (SDSP) in 2002 • Supporting almost 20 projects to promote scientific data sharing • National Science & Technology Infrastructure (NSTI) was launched in 2005 by MOST and Ministry of Finance(Http://www.escience.gov.cn) • Supporting 38 projects for promoting Science and Technology Resources, data and information sharing and Open Access • Total funding ~2 billion RMB
Data Advanced CI for Data Lifecycle in CAS Generation &Collection 1.Field observation stations 2.Large scientific facilities 3.others Storage &Curation Data intensive e-Science Applications Information Stream High Speed Network -CSTNET -CSTNET-CNGI -GLORIAD Information Stream Trans-mission Computing &Analysis Application Information Stream Information Stream Supercomputing Grid -Computing -Analysis -Mining -visualization Data Centers -storage &preservation • Curation • Sharing and Service Information Stream
Data generation • Large scientific facilities produce huge data • +20 in operation • +20 under construction • Long-term field observation stations • +100 stations covering Ecology, Environment, Space, etc. • Other research data, including experiments, modeling, computing, etc. • 100 institutes, more than 50000 researchers in CAS
Network Field Observation • Network expanded to link field observations • Real Time Data Collection • CERN • China Ecology system Research Network • Disaster and Environment Observation • Astronomy and space observation
Meridian Space Weather Monitoring Program • More than 10TB data will be generated and transmitted to Beijing per year • data analysis needs 20Tflops • A data system and processing infrastructure being built
Cosmic-ray observatory: ARGO/AS Cosmic-ray observatory at Yangbajing in Tibet: ARGO: China-Italy ASg: China-Japan ~200TB raw data per year. Data transferred from YBJ-ARGO and processed at IHEP and INFN Rec. data accessible by collaborators.
BEPCII / BESIII BEPC: Beijing Electron-Positron Collider • upgrade: BEPCII/BESIII, operational in 2008 • 2.0 ~ 4.6 GeV/C • (3~10)×1032 cm-2s-1 • 36 Institutions from China, US, Germany, Russian, and Japan • 4000+ KSI2K for data process and physics analysis • 5+ PB in five years
Data Transmission-High Speed Network • China Science and Technology Network (CSTNet) • Non-profitable, academic and research networks in China to support advanced science applications and research on next generation Internet • Connect some 200 institutes, and 1,000,000 end users
CSTNET Backbone Haerbin Xinjiang Changchun Shenyang Beijing Dalian TianJin Shijiazhuang Xining Qingdao Shanxi Lanzhou Nanjing Xian Yangbajing Shanghai Hefei Wuhan Chengdu Lasa Ningbo Figure Changsha Fuzhou 2.5Gb/s Guiyang Kunming Taiwan 1Gb/s Shenzhen Guangzhou Xishuangbanna HongKong 155Mb/s < 155Mb/s
Interconnecting with other Networks Netherland Russia USA 10G 2.5G Gloriad 1G KISTIKorea CERNET 155M 2.5G 2G 1G NICTJapan ChinaNet TELECOM CSTNET HKOEP 2G 2.5G Hongkong Beijing ASHongkong 2.5G 1G 2.5G GOOGLE Hongkong China169 ChinaUnicom 1G 700M 1G 155M HKIXHongkong 1G CUHKHongkong BJNAP Internet
CSTNET-CNGI An IPv6 Network for Science based on CSTNET will start to build this year 10Gbps Jiling XinJiang 羊八井 Beijing 辽宁 InternationalLink10G 兰州 XI’AN 上海 Hefei Nanjing WuHan Chengdu Kunming 100+ Institutes 40+ Field stations and big science facilities Computing facilities and storage facilities Guangzhou
Data Storage and Curation • A General Scientific Data Center • Common data infrastructure construction, operation • Data archive and preservation • Some domain specific scientific data centers • Discipline data curation and sharing service • A CAS scientific data app project • Multi-discipline data sharing and applications • A series of domain-based scientific data sharing systems and institute level data sharing infrastructure
Data Resource Center A General Scientific Data Center A new organization responsible for data preservation, curation and access service in CAS Long-termpreservation ofimportantdata Data Resource Center collaborator Technologyservice Networkstoragespace Management system staff mass data Applicationservice Dataonlineservice Mass data analysis and process systemenvironment Mass data backup
Massive Storage System in Data Resource Center Massive Storage System Scientific data archive system (5PB tape) Online data storage system (1PB disk array) Internet-based service (Cloud Service) Data backup Archiving and curation on-line data access and analysis
Domain Specific Scientific Data Centers • World Data Center(World Data System)in CAS • Natural Resource Environment Data Center • Astronomy Data Center • Space Data Center • Geophysics Data Center • Glacier and Frozen Earth Data Center
Scientific Databases (SDB) A Long-term mission started in 1986 which was funded by CAS data from research, for research Collecting multi-discipline research data and promoting data sharing More than 350 research databases and 400 datasets by 61 institutes Over 60TB data available to open access and download http://www.csdb.cn
Scientific Databases (cont.) • 2 Reference databases • China Species • compound • 4 Application-Oriented databases • High Energy (ITER) • Western Environment Research • Ecology research • Qinghai Lake Research • 8 Resource databases • Geo-Science • Biodiversity • Chemistry • Astronomy • Space Science • Micro biology and virus • Material science • Environment
CAS Scientific Data Grid Scientific Data Grid Applications Scientific Data Grid Chemistry Gateway Other Gateways • Integrating distributed scientific data into a com-prehensive service and application environment • Linking all data canters as a data net Bioscience Gateway Geosciences Gateway Scientific Data Grid Middleware Scientific Data and databases
Scientific Computing Grid CNGRID & environment Database, e-Science, ARP, website, science, TRP Resource Interconnection Cooperation Application service and Technical supporting System, Uniform System operating, Supporting & Service. SCCAS, 120+Tflops Computing capacity Uniform Regulations 8+ Branches: 50 Tflops common Computing capacity Institute Computing Resource 50 Tflops common Computing capacity Lenovo 7000, Peak: 143TeraFLOPS Super Computing Grid Resource Abstracting Access Through network Other network resource and environment Local/Remote User
Scientific Computing Grid Windows / Linux Clients Web Portal Grid Middleware HPC, Cluster, Workstation, Storage
HEP Grid in China Access to the LHC data for scientific research: A grid computing system is built in CAS WLCG MoU signed with CERN in 2006 to build a Tier-2 center at IHEP for both the ATLAS and CMS experiments. IHEP PKU SDU USTC NJU
Tier-2 site at IHEP WLCG site based on EGEE/gLite Associated with CC-IN2P3 in Lyon Work nodes with 1600 cores 400 TB disk space
Typical data intensive e-Science Applications • Developing a series of pilot e-Science applications • Most are data intensive
HEP Grid Applications: ATLAS MC Study Pt>20 GeV/c Tracks ttH-2L selection ttH(2l2b4j2) full simulation event display ttbar mimic to ttHWW
HEP Grid application: protein prediction Explore the non-natural protein sequence space Set up a massive protein structure prediction environment Develop web tools for the biology community Result of EUChinaGrid project (EU FP6 project) KWCWPFASHNDLKVQSQWYVEPPDTIPPYNKYGTNFIKHCQYIAHMQGDTHFFNRVRMHQLWKIIVDCAY Rosetta Early/Late Stage
ChinaFLUX Built in 2002 for climate change and environment research
Data System ChinaFLUX e-Science Environment Observation system Data transmission Modeling and visualization 31
Cyberinfrastructure for data collection Real data from sensors to field stations, then to institutes, finally to data centers to process and share
Data intensive application environment • Data synthesis and integration • Data analysis and modeling • visualization
Conclusion Building an Open Science Cloud serving not only CAS researchers, but also the wider scientific community! Saas Software and tools for data curation, analysis, mining and visualization … OPEN SCIENCE CLOUD Paas Data intensive application environment … DaaS Scientific data and databases Service IaaS Network Service Computing Service Storage Service …