1 / 12

Managing the NextGen data pipeline

Managing the NextGen data pipeline. Jim Denvir, Ph.D. NextGen data challenges. NextGen Sequencing produces very large data sets Order of Terabytes (10 12 bytes) per run Data analysis requires considerable computing power and specialist management

juan
Download Presentation

Managing the NextGen data pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing the NextGen data pipeline Jim Denvir, Ph.D.

  2. NextGen data challenges • NextGen Sequencing produces very large data sets • Order of Terabytes (1012 bytes) per run • Data analysis requires considerable computing power and specialist management • Main challenge is in distilling useful information from raw data WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  3. Core Facility support • Bioinformatics and Genomics core facilities provide support for investigators needing to have NextGen Sequencing data analyzed • Perform analysis from early part of pipeline • Perform downstream analysis, or provide support and software for individual investigators • Depending on needs and expertise of investigator WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  4. NextGen Analysis Pipeline Real Time Analysis performed by RTA software on sequencer Image Analysis Automated Base Calling CASAVA (Illumina) or open source (Tuxedo Suite, R/Bioconductor) * May require custom scripts Demultiplexing* Core Facility Alignment SNP calling or RNA Read Counting Statistical Analysis Partek or R/Bioconductor Investigator Functional Analysis IPA WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  5. Commercial Tools • Examples: RTA, CASAVA, Partek, IPA • Pros: • Short learning curve • Potentially can be used by individual investigators • Usually come with technical support and training • Cons: • Expensive • Closed, proprietary source code WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  6. Open Source • Examples: R/Bioconductor, Tuxedo suite • Pros: • Free • Open source • Enables rapid, community-led improvement • Potentially more academically reviewable • Cons: • Steeper learning curve • Typically prohibitive for individual investigators • Sparse technical support WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  7. Tools developed on site • Pros: • Can fill in missing functionality from available tools • Customized exactly to our needs • Potential for a revenue source • Cons: • Development is very time consuming WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  8. Roadmap • Experience from microarray data analysis suggests: • Start with commercial tools • Rapid start-up enables us to focus on learning scientific basis for the analyses • Transition to open-source tools for some parts of pipeline • Probably mid 2012-mid 2014 • Provides for financial saving further down the road • Sometimes better received by journal reviewers • Initial steps of analysis pipeline and functional analysis will still be managed by commercial software • Develop custom solutions only when needed WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  9. Storing Data • Archiving data from NextGen experiments requires a large amount of disc space • Once analysis is complete, some raw image data will be deleted • Storage of data is more expensive than re-running an experiment! • Will consider exceptions for experiments which cannot be repeated WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  10. NextGen analysis server • Genomics Core has a Linux server for managing analysis and storing data • Housed in Drinko library and managed by central campus IT staff • Has 42 Terabytes of usable disc space • Uses redundant system to allow for potential of drive failures without losing data • Additionally, IT will back up data off site WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  11. Things to remember • Core facilities are there to help! • At experimental design stage, be sure you understand what analysis the core facility will perform • Would you prefer to have IPA done by the core, or would you prefer control over that stage • If so, do you need training and/or support? WV-INBRE West Virginia IDeA Network of Biomedical Excellence

  12. Questions Presentation available at http://users.marshall.edu/~denvir/presentations.html ? WV-INBRE West Virginia IDeA Network of Biomedical Excellence

More Related