270 likes | 428 Views
The Australian National Data Service. Ross Wilkinson For eRSA Thursday, October 22, 2009. Training to Climb an Everest of Digital Data October 11, 2009 MOUNTAIN VIEW, Calif.
E N D
The Australian National Data Service Ross Wilkinson For eRSA Thursday, October 22, 2009
Training to Climb an Everest of Digital Data October 11, 2009 MOUNTAIN VIEW, Calif. It is a rare criticism of elite American university students that they do not think big enough. But that is exactly the complaint from some of the largest technology companies and the federal government. At the heart of this criticism is data. Researchers and workers in fields as diverse as bio-technology, astronomy and computer science will soon find themselves overwhelmed with information. Better telescopes and genome sequencers are as much to blame for this data glut as are faster computers and bigger hard drives.
The Need Research data intensity is increasing Problems are at a larger scale A research group approach is inefficient and variable A institutional approach to data leads to greater research efficiency A national approach to data is an enabler for tackling larger problems, more efficiency and greater synergy A combined approach provides Australia with a means of engaging with the world, but also provides Australia a research collaboration advantage
Who Cares? The Australian Govt – billions for one use data??? The Cutler report – our innovation system The institutions The disciplines?? …and researchers should care..
The Opportunity NCRIS has allocated $24M to the Australian National Data Service - underway EIF funding for $48M for the Australian Research Data Commons Combined, this is the largest project devoted to research data in the world ANDS has to go very fast – the funding is for 2 years EIF funds are designed to establish infrastructure, not run infrastructure, so for ARDC, this means that the focus is on commissioning and installing software systems for ARDC.
The Decisions to date ANDS NCRIS and ARDC EIF will be run as a coordinated program This will favour an institutional approach ARDC will look for and support cross-institutional approaches to nationally significant data initiatives ARDC EIF will concentrate on data access ARDC will strongly favour the ready, willing and able
Activity to date ANDS was requested to commit the first $10M of this program by September 30th, 2009 based on a specification agreed by June 30th, 2009. To do so there was a strong bias in favour of discussions already underway through ANDS NCRIS program These discussions are on track, but not all have started, and not all are concluded. ANDS is about to embark on the next $20M expenditure focused on data capture.
But why should researchers care – and why not? • The Code requires them to • The role of data citation • The changing nature of research .. but the costs often outweigh the benefits, so ANDS needs to help change the equation
Research Data Intensity • SKA will generate an exabyte per day • LHC will generate a petabyte a month • A current generation gene sequencer can generate a terabyte per day • Sensors will routinely be deployed to generate enormous and varied data streams for all disciplines • More data is being captured now that cannot be ever captured again • More data was created last year than can be stored – IDC -http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
Research Data as Research Product • The Human Genome project is known for its data • Currently linguistics is concentrating on data capture rather than analysis as languages expire • Research data that is used collaboratively can be central to research c.f. TREC • Hubble telescope data is an output
Mapping the Human Genome • Took a large team of scientists 10 years to map the 30,000 genes that describe the human body • In 2007, Craig Venter, published his complete DNA sequence, unveiling the six-billion-letter genome of a single individual for the first time • The work required a large team using new instruments to produce a large dataset – indeed 2 competing large teams! • No single lab could have completed this project with available technology in a reasonable time
The Hubble Telescope • The Hubble telescope launched in 1990 • Increasing focus on cross-disciplinary science • Observations are proposed, and if accepted, data is collected and made available to the proposers – who then write a research paper • Each year around 1,000 proposals are reviewed and approximately 200 are selected, for a total of 20,000 individual observations • The data is stored at the Space Telescope Science Institute • There are more research papers written by “second use” of the research data, than by the use initially proposed
Excellence Sharing Detailed Research Data Is Associated with Increased Citation Rate 48% of 85 cancer microarray clinical trial publications with publicly available microarray data received 85% of the aggregate citations Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
Why might you care about ANDS? • Research has become more data intensive data management • Data is increasingly a research output, rather than a research by-product data infrastructure • Excellence in research is correlated with size of effort and with data outputs data preservation • Effective response to the AVCC Code for the Responsible Conduct of Research may be best done collectively
What is ANDS? ANDS was created by government to implement the NCRIS ANDS project and the EIF ARDC project To enable more researchers to re-use research data more often By significantly lowering the costs of capture and finding data and by demonstrating and increasing the benefits of sharing data Through creating a populated data commons - the ARDC, with tools and processes to enable its effective use
ARDC Enablers: Shared desire to change Professional services – research data analysts, research data carers, professional programmers Change partners such as eResearch orgs Changed status of research data
What is the ARDC? The set of data collections that are shareable The descriptions of the collections The relationships between the data, the researchers, the problems, the instruments and the institutions The infrastructure that enables populating and exploiting the commons
Data Flows – Data stays with Institutions Institutions ANDS Search Metadata Collections Repository Data Metadata Data Web Pages
ANDS is being structured as seven co-ordinated inter-related service delivery programs: Frameworks and Capabilities Data Capture Seeding the Commons Public Data Access Metadata storage ARDC Core ARDC Applications
Projects underway: • CSIRO through Auscope capturing collections descriptions from geological surveys and making available through a map interface - http://auscope.org.au/ • The Australian Research Data Commons visible through research data australia • A persistent identifier service – http://ands.org.au/ • Australian literature scholars electronically annotating scholarly works - http://www.itee.uq.edu.au/~eresearch/projects/aus-e-lit/
South Australia: • Engage directly with ANDS – locally with Andrew Williams • Engage in partnership with eRSA • Look for nationally and internationally significant data opportunities
Success: More researchers re-using research data more often… More data sharable More data discoverable Easier to share Easier to find More researchers engaged More institutions engaged