140 likes | 219 Views
Nature Reviews/2012. Next-Generation Sequencing (NGS): D ata Generation. NGS will generate more broadly applicable data for various novel functional assays P rotein-DNA binding Histone modification Transcript levels Spatial interactions Combination of applications into larger studies
E N D
Next-Generation Sequencing (NGS): Data Generation • NGS will generate more broadly applicable data for various novel functional assays • Protein-DNA binding • Histone modification • Transcript levels • Spatial interactions • Combination of applications into larger studies • 1000 Genomes Project
Next-Generation Sequencing (NGS): Data Interpretation • Meaningful interpretation of sequencing data is important • Rely heavily on complex computation • Major problems • Low adoption of existing practice • Difficulty of reproducibility
Problem1:Low Adoption of Existing Practices Example: Variant discovery • A series of accepted and accessible practices from “1000 Genomes Projects” • 299 articles in 2011 cited this project • Only 10 studies used the recommended tools • Only 4 studies used the full workflow • Not following tested practices undermines the quality of biomedical research • Why low adoption? • Over complicated logistical challenges (e.g. resort input data) • Limited application of toolkit (e.g. handful of well-annotated genomes) • Little agreement on what is considered to be the “best practice”
Problem2Difficulty of Reproducibility Example: Read mapping • To repeat a mapping experiment: primary data, software and its version, parameter setting, name of reference genome • 19 studies cited “1000 genomes projects”, only 6 satisfy all details • 50 random selected papers using burrows-wheeler aligner, only 7 provides all details • Most results in today’s publications cannot be accurately verified, reproduced, adopted or used • Why difficult? • Lack of mechanism for documenting analytical steps
Potential of Integrative Frameworks • Combinations of diverse tools under the umbrella of an unified interface • E.g. BioExtract, Galaxy, GenePattern, GeneProf • Advantages • Making data analysis transparent and reproducible • Making use of high-performance computing infrastructure • Improving long-term archiving
1. Promoting Transparency and Reproducibility • Automatic tracking, recording and disseminating all details of computational analyses • GenePattern: embed details into Microsoft Word documents while preparing publication • Galaxy: create interactive Web-based supplements with analysis details • Allow readers to inspect the described analysis in details
2. Using High-performance Computing Infrastructure • High-performance computing resources • Computing clusters at institutions or nationwide efforts, e.g. XSEDE • Private and public clouds • Not accessible to the broad biomedical community • Virtual machines or application-programming interface • With integrative frameworks, anyone can deploy an solution on any type of resource • E.g. CloudMan • User interface for managing computing clusters on cloud resources
3. Improving Long-term Archiving • General vulnerability of centralized resources: longevity of hosted analysis services • Depend on various external factors, e.g. funding climate • With integrative frameworks • Create snapshots of a particular analysis • Compose virtual machine images from analysis to be stored as an archival resource • E.g. Dryad system or Figshare • Export complete collection of analysis automatically for archival • Anyone can recreate a new virtual instance with this archival • Improved reproducibility
Future Directions: Tools Distribution • Current practice • Tools needs to be compiled, installed and supplied with associated data • E.g. short-read mapper requires genome indices • Better practice • Digital platforms providing a set of tools to be automatically installed into users’ integrative framework environment • Pioneer work: e.g. Gparc, Galaxy Tool Shed • Allow sharing of analysis workflows, data sets, visualizations and any other analysis artifacts
Future Directions: Integrate Analysis and Visualization • Current practice • Visualization is the last step of an analysis • Better practice • Visualization as an active component during analysis • Advantages • Users are able to directly sense how parameter changes affect the final result in real time • In the context of publication, it aids readers to evaluate and inspect the results
Conclusion • To sustain the growing application of NGS, data interpretation must be as accessible as data generation • Necessary to bridge the gap between experimentalists and computational scientists • For experimentalists, embrace unavoidable computational components • For computational scientists, ensure the software is appeal to be used • Emergence of integrative frameworks • Tracking details precisely • Ensuring transparency and reproducibility