1 / 34

Understanding Sequence Variations in Bioinformatics

Learn about detecting single-nucleotide polymorphisms (SNPs) and other variations in genetic sequences. Explore methods for SNP discovery, sequence clustering, and Bayesian SNP detection. Understand the challenges in differentiating true variations from errors.

leonore
Download Presentation

Understanding Sequence Variations in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BI420 – Introduction to Bioinformatics Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu

  2. sequence variations make our genetic makeup unique SNP • Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important Sequence variations • Human Genome Project produced a reference genome sequence that is 99.9% common to each human being

  3. inherited diseases demographic history Why do we care about variations? phenotypic differences

  4. TAACAAT • mutations are propagated down through generations MRCA TAAAAAT TAAAAAT TAACAAT TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT • variation patterns permit reconstruction of phylogeny Where do variations come from? • sequence variations are the result of mutation events TAAAAAT

  5. diverse sequence resources can be used EST WGS BAC SNP discovery • comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)

  6. Sequence clustering Cluster refinement Multiple alignment SNP detection Steps of SNP discovery

  7. Two innovative ideas: 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism Computational SNP mining – PolyBayes

  8. Computational SNP mining – PolyBayes sequence clustering simplifies to database search with genome reference multiple alignment by anchoring fragments to genome reference paralog filtering by counting mismatches weighed by quality values SNP detection by differentiating true polymorphism from sequencing error using quality values

  9. 1. Fragment recruitment (database search) 2. Anchored alignment 3. Paralog identification 4. SNP detection SNP discovery with PolyBayes genome reference sequence

  10. Sequence clustering • Clustering simplifies to search against sequence database to recruit relevant sequences • Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3

  11. (Anchored) multiple alignment • The genomic reference sequence serves as an anchor • fragments pair-wise aligned to genomic sequence • insertions are propagated – “sequence padding” • Advantages • efficient -- only involves pair-wise comparisons • accurate -- correctly aligns alternatively spliced ESTs

  12. Challenge • to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference Paralog filtering -- idea • The “paralog problem” • unrecognized paralogs give rise to spurious SNP predictions • SNPs in duplicated regions may be useless for genotyping

  13. Bayesian discrimination algorithm Paralog filtering -- probabilities • Pair-wise comparison between EST and genomic sequence • Model of expected discrepancies • Native: sequencing error + polymorphisms • Paralog: sequencing error + paralogous sequence difference

  14. Paralog filtering -- paralogs

  15. probability cutoff Paralog filtering -- selectivity 375 paralogous ESTs 1,579 native ESTs

  16. sequencing error polymorphism SNP detection • Goal: to discern true variation from sequencing error

  17. A A A A A C C C C C G G G G G T T T T T polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage Bayesian-statistical SNP detection

  18. The SNP score polymorphism specific variation

  19. Distribution of SNPs according to minor allele frequency • Distribution of SNPs according to specific variation • Sample size (alignment depth) SNP priors • Polymorphism rate in population -- e.g. 1 / 300 bp

  20. 76,844 SNP probability threshold Selectivity of detection

  21. African Asian Caucasian Hispanic CHM 1 Validation by pooled sequencing

  22. Validation by re-sequencing

  23. Rare alleles are hard to detect • frequent alleles are easier to detect • high-quality alleles are easier to detect

  24. The PolyBayes software http://genome.wustl.edu/gsc/polybayes • First statistically rigorous SNP discovery tool • Correctly analyzes alternative cDNA splice forms • Available for use (~70 licenses) Marth et al., Nature Genetics, 1999

  25. INDEL discovery Sequencing chemistry context-dependent There is no “base quality” value for “deleted” nucleotide(s) No reliable prior expectation for INDEL rates of various classes

  26. INDEL discovery Deletion Flank Deletion Deletion Flank Insertion Insertion Flank Insertion Flank Q(deletion) = average of Q(deletion flank) Q(insertion flank) >= 35 Q(deletion flank) >= 35

  27. INDEL discovery • 123,035 candidate INDELs (~ 25% of substitutions) • Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%) • Validation rate steeply increases with insertion length < < 61.7% 60.8% 14.3%

  28. sequence is guaranteed to originate from a single location: no alignment problem = sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence SNP discovery in diploid traces usually, PCR products are sequenced from multiple individuals

  29. Homozygous trace peak Heterozygous trace peak SNP discovery in diploid traces

  30. overlap detection SNP analysis candidate SNP predictions SNP mining: genome BAC overlaps inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data

  31. 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 BAC overlap mining results ~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG

  32. 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002

  33. The current variation resource • The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers 1. How are these SNPs structured within the genome? 2. What can we learn about the processes that shape human variability?

  34. New sequencers for SNP discovery

More Related