1 / 26

Bioinformatics for high-throughput DNA sequencing

Bioinformatics for high-throughput DNA sequencing. Gabor Marth Boston College Biology New grad student orientation Boston College September 8 , 2009. DNA sequence variations. The Human Genome Project has determined a reference sequence of the human genome.

sylvia
Download Presentation

Bioinformatics for high-throughput DNA sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

  2. DNA sequence variations The Human Genome Project has determined a reference sequence of the human genome However, every individual is unique, and is different from others at millions of nucleotide locations

  3. Why do we care about variations? underlie phenotypic differences cause inherited diseases allow tracking ancestral human history

  4. Human genetic variation

  5. The first “famous” genomes

  6. Genome sequencing ~1 Mb ~100 Mb >100 Mb ~3,000 Mb

  7. New sequencing technologies…

  8. Next-gen sequencing – a revolution 100 Gb Illumina/Solexa, AB/SOLiD sequencers (10-30Gb in 25-100 bp reads) 10 Gb 1 Gb Roche/454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

  9. IND (ii) read mapping (iv) SV calling (iii) SNP and short INDEL calling IND (i) base calling (v) data viewing, hypothesis generation The re-sequencing informatics pipeline REF

  10. Tools

  11. … and they give you the picture on the box Read mappingis like a jigsaw puzzle… 2. Read mapping …you get the pieces… Big and Unique pieces are easier to place than others…

  12. The MOSAIK read mapping program • Reads from repeats cannot be uniquely mapped back to their true region of origin Michael Strömberg (Wan-Ping Lee)

  13. SNP discovery Marth et al. Nature Genetics 1999 Quinlan et al. in prep. (AmitIndap, Wen Fung Leong)

  14. Navigation bar Fragment lengths in selected region Depth of coverage in selected region Structural variation discovery Stewart et al. in prep. (DenizKural, Jiantao Wu)

  15. Sequence alignment viewers Huang et al. Genome Research 2008 (Derek Barnett)

  16. Data mining

  17. Mutational profiling in deep 454 data Pichia stipitis reference sequence Image from JGI web site • Pichiastipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production) • one specific mutagenized strain had especially high conversion efficiency • goal was to determine where the mutations were that caused this phenotype • we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome) • found 39 mutations • informatics analysis in < 24 hours (including manual checking of all candidates) Smith et al. Genome Research 2008

  18. SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Bristol, N2 strain (3 ½ machine runs) Pasadena, CB4858 (1 ½ machine runs) • goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes • 5 runs (~120 million) Illumina reads were collected by Washington Univ. • we found 45,000 SNP with very high validation rate SNP Hillier et al. Nature Methods 2008

  19. Current focus

  20. 1000 Genomes Project • data quality assessment • project design (# samples depth of read coverage) • read mapping • SNP calling • structural variation discovery

  21. SV discovery in autism deletion amplification

  22. Lab

  23. People

  24. Resources • computer cluster (72 servers) • 128 GB RAM server • ~200TB disk space • 2 R01 grants (NHGRI/NIH) • 1 R21 grant (NIAID/NIH) • a BC RIG grant • 2 RC2 grants (NHGRI/NIH) starting September 2009

  25. Collaborations Genome Canada Baylor HGSC Wash. U. GSC UBC GSC UCSF UCLA UC Davis Cornell Pfizer NCBI @ NIH NCI @ NIH Marshfield Clinic

  26. Graduate student rotations • Looking for new graduate students • Spots are available for all three rotations • Lots or projects • Caveat: you need to be able to program… • Check us out at: http://bioinformatics.bc.edu/marthlab/ • If you are interested, please talk to me

More Related