290 likes | 642 Views
eResearch at CSIRO within the National Collaborative Research Infrastructure Strategy IBS PDIW Workshop: Canberra 23 April 2010. Dr Darrell Williamson, eResearch Director. eResearch (AU) = eScience (EU) = Cyberinfrastructure (US). Overview of Presentation. eResearch Challenge - in IBS
E N D
eResearch at CSIRO within the National Collaborative Research Infrastructure StrategyIBS PDIW Workshop: Canberra 23 April 2010 Dr Darrell Williamson, eResearch Director
Overviewof Presentation • eResearch Challenge - in IBS • CSIRO & NCRIS Capabilities • eResearch Challenge - in Geophysical Sciences • AuScope - An eResearch capability in geosciences developed for a CSIRO Flagship & an NCRIS Capability • Data Storage 3
eResearch Challenge: Integrated Biological Systems Research Preserve, with shared access to, long time series of scientific data spanning many biological systems science disciplines. Access, annotate & analyse large scale, distributed datasets that conform to world standard data formats & international discipline-based standard metadata schemas Ingest, manage, annotate, analyse, share & publish their own data. Develop & integrate modelling, simulation & visualisation tools on high-end computing facilities Use complex scientific workflows that automate research tasks Remotely manage & operate facilities, instruments & sensor networks. Enable advances in biological based scientific research through enabling researchers to: 4
NCRIS: 12 Capabilities CapabilitiesEvolving Biomolecular Platforms & Informatics Integrated Biological Systems Characterisation Fabrication Biotechnology Products Networked Biosecurity Framework Optical & Radio Astronomy Integrated Marine Observing System Structure & Evolution of the Australian Continent Terrestrial Ecosystem Research Network Population Health Research Network Platforms for Collaboration 5
CSIRO: Visualisation of Large Data Sets 100Mpixel image Around 4Mpixels resolution per 30” screen (2560 x 1600) 24Mpixel image 6
NCRIS: 11 + 1 = 12 Capabilities CapabilitiesEvolving Biomolecular Platforms & Informatics Integrated Biological Systems Characterisation Fabrication Biotechnology Products Networked Biosecurity Framework Optical & Radio Astronomy Integrated Marine Observing System Structure & Evolution of the Australian Continent Terrestrial Ecosystem Research Network Population Health Research Network Platforms for Collaboration 7
NCRIS Capability:Platforms for Collaboration Australian Academic & Research Network (AARNet ) Australian Access Federation (AAF) Australian National Data Services (ANDS) Australian Research Collaboration Services (ARCS) National eResearch Architecture Taskforce (NeAT) National Computational Infrastructure (NCI) 8
ANDS: CSIRO Meta-Data Integration Module 9 Collects & integrates meta-data from online sources. Data collection supports various formats & source types: • HTTP / FTP / SOAP / REST / JDBC / Filesystem • XML-RDF / OAI-PMH / CSV / HTML / Propriety formats Collected data is mapped to a domain specific ontology. Original and mapped data is stored in a Fedora repository. Data can be queried through: • Full text queries. • Structured XML queries. • Semantic queries (SPARQL). Data can be retrieved through: • SOAP API • REST API
ANDS: Application in Atlas of Living Australia Biodiversity Information Explorer SOLR / Lucene Triple Store Data Integration Module Fedora ALA Biodiversity Explorer 10
ANDS WRON Data Management To develop technologies to capture public domain science research data such as from the CSIRO Water Resources Observation Network (WRON) and populate the Australian Research Data Commons (ARDC) to enable the data resources to be accessed and re-used easily and efficiently. Overview CSIRO has key water research data holdings of national significance which are not well publicised within the science communities. CSIRO, in collaboration with the Australian National Data Service (ANDS), has made a commitment to populate the Australian Research Data Commons (ARDC). In this way, CSIRO will be able to ensure that the public data it produces can be made available and become more readily accessible for greater scientific and general community collaborations. Data Collections The sample data collections for this project will be from the Sustainable Yields Project from the Water for a Healthy Country Flagship. This is a collaboration between the Land and Water and Marine and Atmospheric Research Divisions of the CSIRO. The Sustainable Yields Project comprises four significant data collections and a further collection populated from remote sensors. They are: Project outputs (tangible deliverables) The ANDS WRON Data Management project will produce the following high-level outputs. • Data cleansing support • Licence tracking middleware • Converted and pre-processed archive data • OAI-PMH Capability • Establishment of Access environment • ANDS Persistent Identifier (PID) support • Storage architecture implementation coordination • Current archive copied to new architecture • NETCDF metadata harvesting support • GeoNetworks feed • Embargo support • Data collection and structure analysis • Current Environment and Technology Analysis • Integration with other ANDS services • System and User Documentation • Transition to Business as Usual Catchment Yield Groundwater Modelling Water Accounting and Environment River Modelling What are the expected project outcomes? The ANDS WRON Data Management project outcomes are: Water research data of national significance will be discoverable and accessible to the broader research community for re-use Continued access to and preservation of the data and ability to curate the data into the future Technologies associated with the capture and translation of the data will be available for transfer to support other data capture Provision of the data will support the ANDS initiative of Public Sector Data Access Infrastructure What is the approach? The ANDS WRON Data Management project has a staged approach consisting of the following stages: Stage 1 – Initiation: Scoping, project plan and approval Stage 2 – Discovery: Identification of key data sets, standard data formats, metadata schemas, and high level requirements for translation tools and software Stage 3 - Implementation: Development of software and translation tools, incorporation of new metadata schemas and data formats into the CSIRO metadata repository, enabling the interface between the ARDC and the CSIRO metadata repository, testing all components, and approval for release into production Stage 4 - Access: Population of the CSIRO metadata repository with datasets from the Sustainable Yields project and the subsequent harvest and population of the ARDC and the WRON. Stage 5 - Re-use: Expand data sets populated to ARDC and WRON and re-use software to cater for a wider range of WRON data sets. Stage 6 - Closure: Transfer of technologies and knowledge and sign-off by the project sponsor. Who is involved? What technology and resources will be used? The resources available within CSIRO that will be used with this project are: CSIRO Land and Water & CSIRO Atmospheric and Marine Research are prime Data Custodians and Data Re-use candidates. They will produce and consume research data, provide context and metadata to the data set or collections, and maintain the instruments and hardware to produce raw data. Existing Fedora Repository to store WRON metadata for harvesting Existing networks and data storage infrastructure Data Management and Technical resources experienced in CSIRO Data Management and in the CSIRO Fedora repository CSIRO resources experienced in collecting and working with WRON Data. CSIRO Information Management & Technology (IM&T) are responsible for the technical development, testing, maintenance and support for ANDS WRON Data Management; delivering the Data Management Service; and for the design, technical development, testing, maintenance and support for the CSIRO metadata repository. They will liaise with ANDS, ARCS, and NCRIS and will publish the data into ARDC. Research communities (internal and external to the CSIRO) will consume data produced by WRON for their research and will also collaborate with WRON. V0.2 18/11/09 11
Agribusiness Energy Environment Information & Communications Manufacturing, Materials & Minerals 10 x National Research Flagships - responsible for Path-to-Impact 5 x Capability Groups - responsible for domain knowledge &Capabilities CSIRO: Capabilities & Path-to-Impact – ‘The Matrix’ 13
CSIRO: National Flagships – Path-to-Impact • Climate Adaptation • Light Metals • Sustainable Agriculture - IBS • Energy Transformed • Minerals Down Under • Water for a Healthy Country • Food Futures - IBS • Preventative Health - IBS • Wealth from Oceans • Future Manufacturing 14
eResearch Challenge: Geophysical Sciences Research (Solution via AuScope) Preserve, with shared access to, long time series of scientific data spanning many geophysical science disciplines. Access, annotate & analyse large scale, distributed datasets that conform to world standard data formats & international discipline-based standard metadata schemas Ingest, manage, annotate, analyse, share & publish their own data. Develop & integrate modelling, simulation & visualisation tools on high-end computing facilities Use complex scientific workflows that automate research tasks Remotely manage & operate facilities, instruments & sensor networks. Enable advances in geophysics based scientific research through enabling researchers to: 15
CSIRO: National Flagships – Path-to-Impact • Climate Adaptation • Light Metals • Sustainable Agriculture - IBS • Energy Transformed • Minerals Down Under - AuScope • Water for a Healthy Country • Food Futures - IBS • Preventative Health - IBS • Wealth from Oceans • Future Manufacturing 16
NCRIS: 12Capabilities CapabilitiesEvolving Biomolecular Platforms & Informatics Integrated Biological Systems Characterisation Fabrication Biotechnology Products Networked Biosecurity Framework Optical & Radio Astronomy Integrated Marine Observing System Structure & Evolution of the Australian Continent- AuScope Terrestrial Ecosystem Research Network Population Health Research Network Platforms for Collaboration 17
Scientific Workflow: Research developments in the Geosciences Geological information 3D numerical model Run simulation Theory Storage Post processing Activities: - geological data integration - scientific theory development - technology development - continuous improvement 19 CSIRO.
Acquisition Groups Analysis & Synthesis Integration Spectrometer Linescan camera Control computer Access Telescope Profilometer Robotic x/y table Cooler AuScope: Infrastructure System 1. Geophysical DataMT, seismic 2. GPS data 3. Geochem, Geochron data 4. Hyperspectral data 21
AuScope: Standardised Information Models • Not a storage problem… • Exchange • Semantics and structure • GeoSciML, OGC • Tool support • Creation and validation Geography Markup Language 22
AuScope: Geoscience Network - Data types Structured Unstructured Point • GPS • Mineral Occurrence • Geochron Large Volume Binary Files • Hyperspectral data • Geophysical data • Satellite data • BLOBs Curve (ID) • Well log • Geophys Profile • Flight line Surface (2D) • Geological Map • Cross section • Swath Solid (3D) • 3D Geological Model • Lidar cloud 23
AuScope: Based on a Spatial Information Services Stack Analysis Workflow Discovery Portal Discovery Layer Community Agreed Service Interfaces and Information Models Exchange Layer Vocabulary Service Service Registry Geological Survey Web Feature Service (WFS) URN Resolver Service With Application Schemas Resources Government Department Data AuScope service catalog Standard Vocabularies 24
Precision Agriculture: Spatial Information Services Stack – a NeAT Project • Spatial prioritisation of catchment incentives • Regional scale climate analyses 26
CSIRO: Data Storage - Consolidation Promising technologies are: WAN Optimisation Desktop Virtualisation (i.e. Citrix) Some key questions: How will desktop virtualisation work with 3D modelling and visualisation ? How will data be accessed ? How will complex workflows be managed in a virtual desktop ? How will data be managed available to a virtual desktop as well remote users ? Data Centre File Server Virtual Desktop Server WAN Optimization Processing is moved from the workstation back to a central server CSIRO Office WAN Optimization User’s Workstation 27
NCRIS: Data Storage – ConsolidationThe $50m dilemma! Model 1 – New Peak National Capability A ‘new’ national facility is created to be the Australian peak research data service. • The envisaged service could be based (in a minimal configuration) around two physical sites supporting a single fully replicated data service open to all researchers. • Because data is held remotely, some form of operating cost contribution would be required from data contributors, subscribers or sector participants. • The cost and funding model would need sector agreement and some risk would be present for the operators. However, the cost for the volume of data envisaged would be significantly less than any possible in-house solution because of the substantive EIF funds and because of the large economy of scale factors. Model 2 – Regional Strength Regionally focussed services would be developed on the basis of existing regional associations. A particular advantage could be gained from building on associations in which state governments have an interest, as this may assist state government agreement to co-locate copies of state generated research related data. Model 3 – Industry Partnerships The sector to work with commercial suppliers to build all or part of the required infrastructure, whilst retaining the provision of an appropriate interface layer within the sector. This approach could contribute to either the new peak national capability model, or the regional store model, described as Model 2. 28
NCRIS: Data Storage – ConsolidationThe $50m dilemma! Essential Requirements • The establishment of a sector based governance, management and implementation mechanism appropriate to the growing importance of research data retention across the sector that is capable of addressing longer term issues, beyond the life of this funding. • The establishment of a process that has sector support, to identify data sets and collections that will be inputs to future research activities, and to focus and apportion resource allocation using the National Research Priorities and national research infrastructure priorities (as identified in the Strategic Roadmap for Australian Research Infrastructure). Issues to consider in the development of criteria are: • What data will be re-used by the research community? • What data sets make up the inputs to research? • Where are the relevant data sets sourced from and how? • For what period and with what access rights should data be retained? • What happens at the end of the retention period? 29
Questions? Dr Darrell Williamson eResearch Director Email: darrell.williamson@csiro.au