560 likes | 664 Views
Cyberinfrastructure: Promises and Challenges. Paul Messina Argonne National Laboratory, CERN, and USC-ISI October 14, 2003. (Cyber) infrastructure.
E N D
Cyberinfrastructure: Promises and Challenges Paul Messina Argonne National Laboratory, CERN, and USC-ISI October 14, 2003 Paul Messina
(Cyber) infrastructure • The term infrastructure has been used since the 1920’s to refer collectively to the roads, bridges, rail lines, and similar public works that are required for an industrial economy to function • The recent term cyberinfrastructure refers to an infrastructure based upon computer, information and communication technology (increasingly) required for discovery, dissemination, and preservation of knowledge • Traditional infrastructure is required for an industrial economy • Cyberinfrastructure is required for an information economy Paul Messina
21st Century Science & Engineering • The three fold way • theory • experiment • computational simulation • Supported by • multimodal collaboration systems • distributed, multi-petabyte data archives • leading edge computing systems • distributed experimental facilities • internationally distributed multidisciplinary teams Simulation Experiment Theory Paul Messina
e-Science and Information Utilities (John Taylor, Head of UK SRCs) e-Science • science is increasingly done through distributedglobalcollaborations between people, enabled by the internet • using very large data collections, terascale computing resources and high performance visualisation • derived from instruments and facilities controlled and shared via the infrastructure • Scaling X1000 in processing power, data, bandwidth Paul Messina
On-demand creation of powerful virtual computing and information systems (CI) through Grids http:// Web: Uniform access to HTML documents http:// Software catalogs Grid: Flexible, high-performance access to all significant resources Computers Sensor nets Colleagues Data archives Paul Messina
Components of CyberInfrastructure-enabled science & engineering High-performance computing for modeling, simulation, data processing/mining People Instruments for observation and characterization. Individual & Global Connectivity Physical World Group Interfaces & Visualization Facilities for activation, manipulation and Collaboration construction Services Knowledge management institutions for collection building and curation of data, information, literature, digital objects Paul Messina
Information-Rich Astronomy • As in most sciences, the amount and complexity of data in astronomy is increasing exponentially, roughly following the Moore's law • Increasingly, most data are taken through large, digital sky surveys (multi-TB, soon multi-PB), over the entire available electromagnetic spectrum • Current holdings in "organized" archives are > 100 TB, with much more data distributed and stored in a disorganized fashion • This is changing the way observational astronomy is done: • a pure survey science • an optimal target selection for large telescopes and space observatories, • etc. Paul Messina
Ongoing Mega-Surveys MACHO 2MASS SDSS DPOSS GSC-II COBE MAP NVSS FIRST GALEX ROSAT OGLE, ... • Large number of new surveys • Multi-terabyte in size, 100 million objects or larger • In databases • Individual archives planned and under way • Multi-wavelength view of the sky • More than 13 wavelength coverage within 5 years Paul Messina
Crab Nebula in 4 spectral regionsX-ray, optical, infrared, radio Paul Messina
The Changing Style of Observational Astronomy Paul Messina
What Will a Virtual Observatory Do? • The VO represents both the core information infrastructure for the new astronomy, and a general research environment in the era of information abundance • Types of VO-based astronomy may include: • Statistical astronomy done right (e.g., precision cosmology of Galactic structure, with large numbers of sources making the Poissonian errors unimportant). • Exploration of the new domains of the observable parameter space (e.g., the time-variable universe, the low surface brightness universe, etc.) • Searches for rare or new, previously unknown types of objects or phenomena (e.g., brown dwarfs, high-redshift quasars, … ??) Paul Messina
The VO Vision • One can think of the VO as a genuine observatory that astronomers use from their desks. • It will supply the • digital archives, • metadata management tools, • data discovery and access services, • applications-programming interfaces, • ancillary information services, and • computational services. Paul Messina
Discoveries made possible by federated data collections • Easy access to data from different observations of same object enables detection of objects an order of magnitude fainter than currently possible • Can go fainter in image space because have more photons from the combined images and because the multiple detections can be used to enhance the reliability of sources at a given threshold • A group of faint pixels may register in a single wavelength at the two-sigma level (might be noise) • If same pixels are at two-sigma in other surveys, overall significance may be boosted to five sigma Paul Messina
Virtual Sky: Image Federation http://virtualsky.org/ from Caltech CACR Caltech Astronomy Microsoft Research Virtual Sky has 240,000,000 tiles 250 Gbyte Change scale Change theme Optical (DPOSS) Xray (ROSAT) theme Coma cluster Paul Messina
VO Interoperability • Directory white/yellow pages • VO Schema repository • How to publish, how to define content • action> Document schema for • archive content • web service capability • table/image/spectrum • How to publish algorithms • Plug and play data services • SOAP, UDDI, WSDL, Jini? • Semantic web • “Question and answer in the language of the client” • Topic maps Paul Messina
VO and Interoperability What is the standard interaction with a catalog service? Client Service Client application (e.g., OASIS) asks for available catalogs then for attributes makes a query then displays the result Want to be dynamic, be able to add catalogs and the applications still work Paul Messina
Effective communication is crucial • Interoperating resources, be it by people or systems, requires a consistent shared understanding of what the information contained means • “... people [and machines] can’t share knowledge if they don’t speak a common language” (Davenport) Paul Messina
Metadata and Terminology • Metadata is important • Data describing the content and meaning of resources • But everyone must speak the same language… • Terminologies provide • Shared and common vocabularies • For search engines, agents, curators, authors and users • But everyone must mean the same thing… • In projects and subdisciplines must develop common terminologies • Especially if the team is multidisciplinary Paul Messina
Ontologies are needed to communicate and understand • Dictionary definition of ontology: deals with the question of how many fundamentally distinct sorts of entities compose the universe • Ontologies provide shared and common understanding of a domain • Are essential for search, exchange and discovery Paul Messina
Long-term VO Vision • One can imagine that astronomers applying for observation time on physical observatories will first have to carry out observations in the Virtual Observatory • Astronomy will be democratized • “any” astronomer will have access to “all” the data Paul Messina
VOs are international • There are many astronomy data collections in the UK, Europe, Japan, etc. • And there are projects funded to create VOs Paul Messina
A Few Examples of Other Projects Paul Messina
“BIRN” is an NIH PROJECT to ESTABLISH a BIOMEDICAL INFORMATICS RESEARCH NETWORK Paul Messina
EACH BRAIN REPRESENTS A LOT OF DATA AND COMPARISONS MUST BE MADE BETWEEN MANY BRAINS We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns Paul Messina
mammograms X-rays MRI cat scans endoscopies, ... Digital Radiology(Hollebeek, U. Pennsylvania) • Hospital Digital Data • Very large data sources - great clinical value to digital storage and manipulation and significant cost savings • 7 Terabytes per hospital per year • dominated by digital images • Why chose Mammography • clinical need for film recall and computer analysis • large volume ( 4,000 GB/year ) (57% of total) • storage and records standards exist • great clinical value to this application Paul Messina
Managing Large-Scale Data Hierarchical Storage and Indexing Highly Distributed Source Paul Messina
Components of CyberInfrastructure-enabled science & engineering High-performance computing for modeling, simulation, data processing/mining People Instruments for observation and characterization. Individual & Global Connectivity Physical World Group Interfaces & Visualization Facilities for activation, manipulation and Collaboration construction Services Knowledge management institutions for collection building and curation of data, information, literature, digital objects Paul Messina
To create CIs that can support such applications, need highly coordinated, persistent, major investment in… • Research and development • Base technology • CI components & systems • Science-driven pilots • Operational services • Distributed but connected (Grid) • Exploit commonality, interoperability • Advanced, leading-edge but… • Robust, predictable, responsive, persistent Paul Messina
Need highly coordinated, persistent, major investment in… • Domain science communities (CI in service of R&D) • Specific application of CI to revolutionizing research (pilot -> operational) • Required not optional. New things, new ways. • Education and broader engagement • Multi-use: education, public science literacy • Equity of access • Pilots of broader application: IT for Research Universities, industry, workforce & economic development Paul Messina
Data Repositories • Well curated data repositories are increasingly important to science and engineering research, allowing data gathered and created at great expense to be preserved over time and accessed by researchers around the world, including by disciples of other disciplines • Many, often very large, repositories • In different locations • For different disciplines • Curated by appropriate groups Paul Messina
Data Repositories R&D • Develop, distribute, and maintain • tools to organize and manage large repositories • appropriate standards that allow data to be self-documenting and discoverable through automated tools • Important to insure the interoperability necessary to incorporate data acquired in one discipline into applications serving other disciplines Paul Messina
Digital Libraries • Will contain (much more so in the future than today) our intellectual legacy • a fundamental resource for scientific and engineering research and engineering practice • Devise and implement new mechanisms for sharing, annotating, reviewing, and disseminating knowledge Paul Messina
Computational Centers • Very powerful computers • On-demand computing • Co-scheduling with other resources • Computers • Data repositiories • Viz servers • Visualization servers • Data archives • Application software • Skilled people Paul Messina
Networking and Connections • High-speed networks are a critical infrastructure for facilitating access to the large, geographically distributed computing resources, data repositories, and digital libraries. • The commodity Internet is clearly not up to the task for high-end science and engineering applications • especially where there is a real-time element (e.g. remote instrumentation and collaboration) Paul Messina
NSF BRP budget recommendations, Million $/yr. Fund and appl research to advance CI $ 60 Research into applications of IT to advance $ 100 scientific and engineering research Acquisition and development of $ 200 cyberinfrastructure and applications Provisioning and operations of CI & apps $ 660 Computational centers $375 Data repositories $185 Digital libraries $ 30 Networking and connections $ 60 Application service centers $ 10 Total $1020 Paul Messina
Technical challenges • Data volume and the resulting difficulty in • sending copies of subsets to many sites • storing in a safe way • finding data (requires development of dauntingly complex metadata) • Data management • Need a consistent and complete mechanism for data management • tools to manage storage access, data transfer, replica management, and file access from jobs. • Workflow management • allow jobs to move across grids and receive status and output Paul Messina
Heterogeneity makes interoperability difficult • Computing resources • Storage resources • Applications • Network speeds • Management domains • Policies, especially security mechanisms and policies Paul Messina
Distributed “everything” issues • Identifying best resources available for task at hand, in real time • computing and data resources are heterogeneous; more difficult task than sharing electricity generated by different plants • Global access and global management of massive and complex data • Monitoring, scheduling, and optimization of job execution on a heterogeneous grid of computing facilities and networks • End-to-End networking performance Paul Messina
Dealing with dynamic requirements • On demand computing • Resource identification • Fault tolerance • Virtual data (retrieve instead of recompute, unless it will cost less to recompute) • Book-keeping of what has been computed, what has not, in a global environment Paul Messina
Research challenges • Workflow & DB integration, co-optimized • Distributed queries on a global scale • Maintaining a global view of resources and system state • Real-time monitoring and error detection • Supporting dynamic workload and environment • Authorization, Resources, Data & Schema • Performance • Metadata for discovery, automation, repetition, … • Provenance tracking Paul Messina
Managerial challenges • Transition from research prototype software to production software • And relationship between groups who develop new software and groups charged with supporting an operational grid • Establishing mechanisms for sharing resources that are funded in part for other applications • Balancing between developing application specific solutions versus using “community” or commercial software wherever possible Paul Messina
Making it happen • NSF NMI • MAGIC • (Hopefully) NSF’s implementation of CI report • EU’s e-infrastructure, EGEE project • Global Grid Forum • OMII • Other country and world region efforts Paul Messina
NSF National Middleware Initiative (NMI) • The purpose of the NSF Middleware Initiative (NMI) is to design, develop, deploy and support a set of reusable, expandable middleware functions and services that benefit many applications in a networked environment, and which will • a) facilitate scientific productivity, • b) increase research collaboration through shared data, digital libraries, computing, code, facilities and applications, • Etc. • Great program – needs to be funded at much higher level Paul Messina
NSF’s Extensible Terascale Facility (ETF) • ETF goal: build and deploy very powerful distributed computational infrastructure for general scientific research • Tens of teraflops • 1+ petabytes of data • Visualization • Software tools • Operations and policies • Argonne, Caltech, NCSA, Pittsburgh, SDSC are linked • NSF just announced several new nodes: Indiana and Purdue universities, Oak Ridge National Laboratory and The University of Texas • not only computing resources, but also scientific (e.g., neutron-scattering) instruments and data collections Paul Messina
Middleware And Grid Infrastructure Coordination Committee (MAGIC) • US Multi-Agency committee established by the Large Scale Networking (LSN) Coordinating Group of the Interagency Working Group for Information Technology Research and Development in February, 2002 • Recently produced report Blueprint for Future Science Middleware and Grid Research and Infrastructure Paul Messina
Global Grid Forum • Need standards to create a global grid/CI • The Global Grid Forum (GGF) is a community-initiated forum • GGF's primary objective is to promote and support the development, deployment, and implementation of Grid technologies and applications via the creation and documentation of "best practices" - technical specifications, user experiences, and implementation guidelines • ~ 50 Working and Research Groups • ~ 50 sponsor organizations (many commercial companies) • Please see www.ggf.orgfor more information and how to get involved Paul Messina
International considerations • Cyberinfrastructure and grids must have international scope • People, applications, instruments, data collections are geographically distributed across the globe • In some cases, other countries/regions are ahead of US • UK e-Science started several years ago, 250M Pounds, involves 80 commercial companies • EU has funded ~ 20 projects Paul Messina
EU e-Infrastructure • Concept driven by the vision to providea“One stop shopping” service to researchers in Europe on accessing on demand the necessary IT-resources (connectivity, computing, data, instrumentation…) for their work - a utility driven concept.. • Concept based on co-ordinated shared use of resources across administrative, technology and application domains; provision of integrated communication and information processing services to the user; global collaborations; operational support... Paul Messina
EGEE project about to start - next generation EU Grid RI • European integration of resources Picture courtesy: EGEE Consortium Paul Messina