270 likes | 670 Views
Biomedical Big Data Initiative (BD2K) . Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K ) . Myriad Data Types. Genomic. Other ‘ Omic. Imaging. Phenotypic. Exposure. Clinical. Data and Informatics Working Group.
E N D
Biomedical Big Data Initiative (BD2K) Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K)
Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical
Data and Informatics Working Group acd.od.nih.gov/diwg.htm
What Are the Big Problems to Solve? 1. Locating the data 2. Getting access to the data 3. Extending policies and practices for data sharing 4. Organizing, managing, and processing biomedical Big Data 5. Developing new methods for analyzing biomedical Big Data 6. Training researchers who can use biomedical Big Data effectively
Overarching Strategy and Goals Two initiatives being proposed to overcome roadblocks Big Data to Knowledge (BD2K) – enable the biomedical research enterprise to maximize the value of biomedical data InfrastructurePlus – create an adaptive environment at NIH to sustain world-class biomedical research
Big Data to Knowledge (BD2K): Overview • Major trans-NIH initiative addressing an NIH imperative and key roadblock • Aims to be catalytic and synergistic • Overarching goal: • By the end of this decade, enable a quantum leap in the ability of the biomedical research enterprise to maximize the value of the growing volume and complexity of biomedical data
BD2K: Four Programmatic Areas • Facilitating Broad Use of Biomedical Big Data • II. Developing and Disseminating Analysis Methods and Software for Biomedical Big Data • III. Enhancing Training for Biomedical Big Data • IV. Establishing Centers of Excellence for Biomedical Big Data
Area 1: Data Sharing & Access • Facilitating usage and sharing of biomedical big data • New Policies to Encourage Data & Software Sharing • Index of Research Datasets to Facilitate Data Location & Citation • Community-based Development of Data & Metadata Standards A. Policies to Facilitate Data Sharing. B. Data Catalog: Data Discovery, Citation, Links to Literature. C. Frameworks for Community-Based Solutions to Developing Data Standards. D. Enabling Research Use of Clinical Data.
Area 2: Software and Systems Development • Development of analysis methods and software • Software to Meet Needs of the Biomedical Research Community • Facilitating Data Analysis: Access to Large-scale Computing • Dynamic Community Engagement of Users and Developers A. Grants for software development B. Software Registry: Making biomedical software findable and citable C. Cloud computing: Facilitating Data Analysis D. Dynamic Social Engagement via social media
Area 2: Software and Systems Development Software Grants Current and emerging needs for using, managing, and analyzing the larger and more complex data sets inherent to biomedical Big Data • Compression/Reduction • Visualization • Provenance • Data Wrangling
Area 2: Software and Systems Development Big Data needs Big Computing Cloud Computing • Leveraging the cloud • Storing and analyzing huge data sets • Collaborative environment • Developing appropriate policies for use of controlled access data in the cloud (dbGaP) • Developing working relationships with major cloud providers • AWS, Google, Microsoft (Azure) HPC • More exploration with Supercomputing facilities
Area 3: Training • Enhancing computational training • Increase Number of Computationally Skilled Trainees • Strengthen the Quantitative Skills of All Researchers • Enhance NIH Review and Program Oversight
Area 4: Centers • Establishing centers of excellence • Collaborative environments & technologies • Data integration • Analysis & modeling methods • Computer science & statistical approaches A. Investigator-initiated Centers B. NIH-specified Centers
Big Data to Knowledge (BD2K) bd2k.nih.gov
Biomedical Research as Part of the Digital Enterprise Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health
Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical
Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical
Components of The Academic Digital Enterprise • Consists of digital assets • E.g. datasets, papers, software, lab notes • Each asset is uniquely identified and has provenance, including access control • E.g. publishing simply involves changing the access control • Digital assets are interoperable across the enterprise
Let’s Break Down the Silos • New policies, regulations e.g. data sharing • Economic drivers • The promise of shared data
The NIH is Starting to Think About the Digital Enterprise Big Data to Knowledge (BD2K) bd2k.nih.gov
This is great, but BD2K is just a start, what will the end product look like?
To get to that end point we have to consider the complete research lifecycle
The Research Life Cycle will Persist IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Tools and Resources Will Continue To Be Developed Authoring Tools Data Capture Analysis Tools Scholarly Communication Lab Notebooks Software Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Those Elements of the Research Life Cycle will Become More Interconnected Around a Common Framework Authoring Tools Data Capture Analysis Tools Scholarly Communication Lab Notebooks Software Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
New/Extended Support Structures Will Emerge Authoring Tools Data Capture Analysis Tools Scholarly Communication Lab Notebooks Software Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Discipline- Based Metadata Standards Community Portals Data Journals Git-like Resources By Discipline New Reward Systems Commercial & Public Tools Training Institutional Repositories Commercial Repositories
bonazziv@nih.gov Thank You Questions?