120 likes | 146 Views
Discover the various insertion-deletion type polymorphisms (INDELs) and single-nucleotide polymorphisms (SNPs) with Gabor T. Marth. Learn about SNP discovery, comparative analysis, sequence clustering, and SNP genotyping in diverse sequences. Validate SNP scores accurately and explore genome-scale SNP mining projects.
E N D
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA 02467
Various insertion-deletion type polymorphisms (INDELs) are also very common Types of sequence variations • Substitution-type single-nucleotide polymorphisms are the most abundant form of sequence variations
systematic pattern of bi-allelism within the population examined Are all substitutions SNPs?
includes the organization of sequences relative to each other, and determining if sequence differences are sequencing artifacts or true polymorphisms ? What is SNP discovery? • comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)
Sequence clustering Paralog identification (cluster refinement) Multiple alignment SNP detection Steps of SNP discovery
different sequence types are radically different in terms of their accuracy genome sequence: 99.9 – 99.99% single pass sequence: 98-99% SNP discovery in diverse sequences • many different types of sequences are available for polymorphism discovery genome EST WGS BAC BAC-end restriction fragments • early methods of SNP discovery focused on specific sequence types
General SNP mining – PolyBayes sequence clustering simplifies to database search with genome reference multiple alignment by anchoring fragments to genome reference paralog filtering by counting mismatches weighed by quality values SNP detection by differentiating true polymorphism from sequencing error using quality values
Validation experiments show that the SNP probability or SNP score is accurate African Asian discard keep Caucasian The SNP score allows one to choose cutoff values that balance false positive rate and the recovery of rare SNPs Hispanic CHM 1 SNP validation • Pooled sequencing • Direct re-sequencing
Random, shotgun reads from whole-genome libraries aligned to the genome reference sequence Genome-scale SNP mining projects • Overlaps of large-insert clone sequences
aacgtttatgtgattaccagtaaattacggca aacgtttatgtgattcccagtaaattacggca person 1. aacgtttatgtgattaccagtaaattacggca aacgtttatgtgattcccagtaaagtacggca person 2. SNP genotyping • SNP discovery: which nucleotides in the genome are polymorphic? ag aacgtttatgtgatt|ccagtaaa|tacggca ct • SNP genotyping: which alleles does an individual carry at a nucleotide locus that is known to be polymorphic?
heterozygous peak homoozygous peak Genotyping by sequence
marker density “dense” “sparse” allele frequency “common” “rare” Genome variation landscape • nucleotide diversity on human chromosomes