10 likes | 145 Views
Pandora’s Box Data Discovery Pipeline Justin Winstead 1 , Denise P. Mu ñ oz 1 , Bogdan Pasaniuc 2 , Sing-Hoi Sze 3 , Eran Halperin 2 , Victoria Lunyak 1 1. Buck Institute, 2. International Computer Science Institute at UC Berkeley, 3. Texas A&M University. Introduction
E N D
Pandora’s Box Data Discovery Pipeline Justin Winstead1, Denise P. Muñoz1, Bogdan Pasaniuc2, Sing-Hoi Sze3, Eran Halperin2, Victoria Lunyak1 1. Buck Institute, 2. International Computer Science Institute at UC Berkeley, 3. Texas A&M University Introduction Next generation SOLiD sequencing technology is expanding the boundaries of traditional genetic analysis and enabling new applications such as whole genome resequencing, gene expression profiling, and CHiP and methylation analysis. The SOLiD system uses a process of stepwise cycled ligations that has been developed for high throughput DNA sequencing. Current enhancements to the SOLiD system include the development of library construction methods that afford genome sequencing of short fragment and mate-paired libraries. Recent improvements have demonstrated performance of >4 Gb per single tag (fragment library) and >6 Gb per dual tag (mate paired library). The data generated is used in completing primary data analysis (image acquisition and quality control of color space calls) and secondary analysis (alignment to reference genome, base calling and SNP identification). TIMELINE Library Preparation: 6-7 days Cell lysis, DNA/RNA extraction (application dependant), purification and amplification of sequencing library. Depending on the application goals, either a fragment or a mate paired library can be generated and sequenced on the SOLiD system. ePCR: 4-5days Input library titration based on insert size of DNA. Products are clonally amplified on magnetic beads. After ePCR, targeted template beads are separated from untemplated beads on which no reaction occurred. Successfully recovered DNA amplified beads are processed for slide deposition onto SOLiD for analysis. Sequencing: 7-10 days Parallel Sequencing takes place on glass slides to which the beads are attached. Sequence starts from a primer that anneals in one of the ends of the template DNA. Octameric probes anneal with the template DNA and ligate to the primer. Probe specificity is achieved by using di-base encoding and interrogating the 1st and 2nd base in color space. The ligated probe is imaged and the color associated with the templated bead is recorded. Phosphatase cleavage of the probe releases the flourophore. The ligation and cleavage steps are repeated. Once all cycles are complete for a primer, the extended strand is melted off and a complimentary primer offset by one base in the 5’ direction is hybridized for another round of ligation cycles. Primer denaturation and ligation rounds are repeated sequentially five times resulting in the sequencing read36 b.p. Data Assessment: 1-day Primary Analysis of images of fluorescent probes occurs on SOLiD module and results in assigned color call read assignments. Software on SOLiD Analyzer provides machine run performance assessment and ability to control or change run progress point (i.e. re run primer or ligation cycle within a primer). Secondary Analysis of sequence data compares color based calls of running sample against a partial reference genome that can be loaded and stored on the system software. (Human chromosome 17 default reference genome). Data Transfer: 1-day After each run, the 1-2 Tb of data generated can be backed up on an external storage device before being automatically transferred to the 12 Tb storage array Anselmo at the Buck Institute. Once transferred, collaborative bioinformatics teams at ICSI and Texas A&M University analyze the data. Data synced to 3 Tb storage unit Anselmo NCBI Database UCSC Genome Browser ICSI UC Berkeley • Data Analysis • secondary • tertiary Buck Institute Database Output Input • Applications • De novo whole genome sequencing • Transcriptome analysis: RNA-seq • CHiP analysis: CHiP-seq • DNA methylation analysis: Met-seq • ncRNA and miRNA discovery • Nucleosomal CHiP Texas A & M Onboard Linux 7 CPU with Windows interface and 3 Tb storage array Transfer to Anselmo 12 Tb array SOLiD Instrumentation module Data Analysis Cost per one sequencing run for SOLiD Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library) $35 GeneAmp dNTP blend: (per sample) $40-80 Total Cost: $115 ePCR: One emulsion ePCR one reaction: $300 Enrichment one reaction: $210 Preparation buffers: $30 Consumables: $55 Labor: $800 Total Cost: $1395 Sequencing: One slide Bead deposition: $110 Sequencing probes: (35 cycles) $600 Fragment oligo-sequencing kit: (35 cycles) $1020 Instrument buffers: $283 Sequencing slide: $92 Other Consumables: $200 Service contracts: $1056 Labor: $100-200 Total Cost: $3381 Anselmo storage array Shared access with bioinformaticians for data analysis): $166 Total Cost: $166 • Acknowledgements: • We would like to express our appreciation to those who implemented the SOLiD instrument: • Buck Institute: Ralph O’rear, Kevin Kennedy, Peter Jolly • ABI: Rob G. David, Goli Shariat • Miriam Phillips & Justine Bock, Buck Institute, for assistance with this poster.