1 / 30

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

This paper presents a bioinformatics pipeline for analyzing fosmid-based molecular haplotype sequencing data. It includes the identification of MHC haplotypes, targeted enrichment, phasing of molecular fosmid sequences, and SNP analysis. The pipeline has been tested on a dataset of 100 individuals and 5000 fosmids. The accuracy of SNP calling and haplotype phasing was assessed using Affymetrix genotype information as a reference standard.

shernandez
Download Presentation

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing • Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1 • 1. Max Planck Institute for Molecular Genetics, Berlin, Germany • 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA

  2. 29,74 31,59 32,34 33,21 MHC class I MHC class III MHC class II MHC: Key Region for Common Diseases & Transplant Medicine

  3. MHC: Variation amongst Haplotypes Variation of MHC Haplotypes against PGF reference RCCX CNV HLA-DRB CNV Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18 7 further MHC Haplotype sequences PGF reference sequence MHC class III MHC class II • Variation amongst 8 MHC Haplotypes: • 37.451 Substitutions • 7.093 Short Indels

  4. 40 kb haploid molecules 5000 fosmids 100 Individuals 100 Libraries One pool SNP Mapping for Prioritization of MHC Informative Pools Complete Fosmid Pool 3x96-well = 288 fosmid pools Contiguous MHC haplotype sequence Identification of 40 kb fosmid sequences Targeted Enrichment Haplotype A Haplotype B Phasing molecular fosmid sequences Experimental Approach SOLiD NGS Platform Shotgunning complete 40 kb fosmids Data Analysis Pipeline T G A

  5. Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Visualization & MHC Database Consensus CallingSNP Analysis In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

  6. Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus CallingSNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

  7. Mapping real data Pool of 15.000 Fosmids 22 Mill. Reads 50bp

  8. Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus CallingSNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

  9. SNP calls: Haploid fosmids vs. genomic DNA

  10. SNP Calling Accuracy in the MHC • Affymetrix genotype information for 1583 SNP positions as reference standard: • - Homozygous identical with reference: 957 • - Heterozygous: 562 • - Homozygous different from reference: 64 • Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage) • Percentage of error in genotype calling: 3.66% • False positive rate: 0.1% • False negative rate: 9.25%

  11. Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus CallingSNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

  12. Fosmids Detection • Fosmid Detection Algorithm • Assign each read to a single 1kb long bin. Select bins with more than 5 reads • Perform allele calls for each heterozygous SNP. Mark bins with heterozygous calls • Cluster adjacent bins as belonging to the same fosmid if: • The gap distance between them is less than 10kb and • There are no bins with heterozygous SNPs between them • Keep fosmids with lengths between 3kb and 60kb UCSC Genome browser http://genome.ucsc.edu/Kent et al. 2002 Genome Res.12(6):996-1006.

  13. Fosmids Detection Size distribution of read-contigs 20 – 50 kb fosmid sized contigs

  14. Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus CallingSNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

  15. Haplotyping The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping

  16. Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci

  17. Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci

  18. Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci

  19. Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci

  20. ReFHap Problem Formulation For two alleles a1, a2 For two rows i1, i2 of M s(M,1,2) = 1

  21. ReFHap Problem Formulation For a cut I of rows of M

  22. ReFHap Algorithm • Reduce the problem to Max-Cut. • Solve Max-Cut • Build haplotypes according with the cut 4 1 -1 3 1 2 1 -1 3 h1 00110 h2 11001

  23. ReFHap Algorithm • Build G=(V,E,w) from M • Sort E from largest to smallest weight • Init I with a random subset of V • For each e in the first k edges • I’ ← GreedyInit(G,e) • I’ ← GreedyImprovement(G,I’) • If s(M, I) < s(M, I’) then I ← I’

  24. 1 4 1 4 2 3 2 3 ReFHap Algorithm • Classical greedy algorithm

  25. 1 2 3 4 ReFHap Algorithm • Edge flipping 2 1 3 4

  26. Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS

  27. Phasing MHC:Preliminary Results • Number of blocks: 8 • N50 block length: 793 kb • Maximum block length: 1.6 MB • Total extent of all blocks: 3.8 MB • Fraction of MHC phased into haplotype blocks: 95% • Number of heterozygous SNPs: 8030 SNPs • Fraction of SNPs phased: 86%

  28. Acknowledgements Anita Suk Thomas Hübsch Margret Hoehe Roger Horton Gayle McEwen Steffi Palczewski Britta Horstmann Sabrina Schulz The Life Tech Team: Kevin McKernan Alexander Sartori Clarence Lee Dustin Holloway Jessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins Thank You!

  29. Comparison Mapping algos COX Haplotype simulated reads

  30. Phasing MHC

More Related