1 / 22

Alexis Dereeper

Detection and analysis of SNP polymorphisms. Alexis Dereeper. CIBA courses – Brasil 2011. Objectives. Short reads Solexa. To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data). Mapping SAM.

leland
Download Presentation

Alexis Dereeper

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detection and analysis of SNP polymorphisms Alexis Dereeper CIBA courses – Brasil 2011

  2. Objectives Short reads Solexa • To know and manipulate available packages/tools for SNP and INDEL detection from NGS data • (assembly of NGS data) Mapping SAM • To think about difficulties encountered when analysing new generation sequencing data • (differentiate sequencing errors, paralogs and allelic variation) Allelic variations • Detect SNP and assign genotypes to every polymorphic positions A/G 1998 T/C 2341 T/G List of SNPs • Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD) Ind1 ATTGTGTCGTAACGTATGTCATGTCGT Ind2 ATTGTGTCGGAACGTATGTCATGTCGT Ind3 ATTGTGTCGKAACGTATGTCATGTCGT • Obtain an exploitable dataset to send for the design of a high-throughput SNP chip • (Illumina VeraCode technology) Assignation of genotypes Exploitation of polymorphism data Design of a Illumina SNP chip

  3. Tablet • Graphical viewer for assembly of NGS data • Accepts different formats: • ACE, SAM, BAM Alexis Dereeper CIBA courses – Brasil 2011

  4. Automatic detection of SNP from SAM assembly Fastq Example of pipeline faisable with the Galaxy system: 3 alternatives FastQ Groomer PicardTools SamTools Mapping BWA GATK SAM assembly VarScan AddReadGroupIntoSam SNiPlay Utilities SAM-to-BAM SAM-to-BAM Generate Pileup SamToFastaAlignments IndelRealigner Pileup file FASTA alignments with IUPAC CountCovariates TableRecalibration Pileup2snp UnifiedGenotyper SNP tabular file VCFToFastaAlignments VCF file Alexis Dereeper CIBA courses – Brasil 2011

  5. Varscan • Program for SNP detection from Pileup file : Pileup2snp • Another module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen Pileup format Text file describing for each position: base for reference, depth of coverage, variations, quality seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Alexis Dereeper CIBA courses – Brasil 2011

  6. SamToFastaAlignments and AceToFastaAlignments: SNiPlay utilities for management of NGS data Mapping: SAM format Threshold values per genotype Assemblage: Ace format Depth Frequency Depth CL1Contig1 genotype1 0 1 1 genotype2 0.3 4 2 genotype3 0.3 4 2 Depth threshold Depth threshold Heterozygosity For heterozygosity estimation For position For each contig List of heterozygous positions Stats: estimation of average heterozygosity for each genotype + + FASTA alignments including IUPAC A CL1Contig1.align.fa A Y T W + CL1Contig2.align.fa , CL2Contig1.align.fa … Alexis Dereeper CIBA courses – Brasil 2011

  7. GATK (Genome Analysis ToolKit) • Package for analysis of NGS data. • Developed for the analysis of Human medical resequencing projects • (1000 Genomes, The Cancer Genome Atlas) • Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery • Complementary of 2 other packages: SamTools, PicardTools PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer) Alexis Dereeper CIBA courses – Brasil 2011

  8. Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) FastQ Groomer FastQ Groomer FastQ Groomer FastQ Groomer …. Mapping BWA Mapping BWA Mapping BWA Mapping BWA AddReadGroupIntoSam AddReadGroupIntoSam AddReadGroupIntoSam AddReadGroupIntoSam SAM with read group SAM with read group SAM with read group SAM with read group mergeSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file

  9. Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) Fastq global FastQ Groomer Mapping BWA AddReadGroupIntoSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file

  10. VCF format (Variant Call Format) Advantages: describes the variations for each position + genotype assignation ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 Alexis Dereeper CIBA courses – Brasil 2011

  11. Other functionalities of GATK • DepthOfCoverage module: • Enables to inform sequencing depth of coverage for each gene, each position and each individual • ReadBackedPhasing module: • Enables to define if possible allele association (phase or haplotype) in case of heterozygosity… And not AGG GGA Alexis Dereeper CIBA courses – Brasil 2011

  12. SNiPlay: Web-based application for polymorphism analysis http://sniplay.cirad.fr Alexis Dereeper CIBA courses – Brasil 2011

  13. Automatic detection of SNP from SAM assembly Fastq Example of pipeline faisable with the Galaxy system: 3 alternatives FastQ Groomer PicardTools SamTools Mapping BWA GATK SAM assembly VarScan AddReadGroupIntoSam SNiPlay Utilities SAM-to-BAM SAM-to-BAM Generate Pileup SamToFastaAlignments IndelRealigner Pileup file FASTA alignments with IUPAC CountCovariates TableRecalibration Pileup2snp UnifiedGenotyper SNP tabular file VCFToFastaAlignments VCF file Alexis Dereeper CIBA courses – Brasil 2011

  14. Options of SNiPlay Select the VCF format Load the VCF file Load reference file Select the Rice genome as reference

  15. Alexis Dereeper CIBA courses – Brasil 2011

  16. Design of Illumina chip Submission file for Illumina Genotyping file Analysis with the BeadStudio software Cartesian coordinates Alexis Dereeper CIBA courses – Brasil 2011

  17. Allelic files cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4 cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4 cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4 • PED format • DARwin format @DARwin 5.0 - ALLELIC - 2 33 20 N° 50 50 122 122 218 218 245 245 261 261 290 290 356 1 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 1 1 3 3 1 3 4 4 2 2 2 3 1 1 1 1 3 3 3 3 4 4 2 2 2 4 1 1 1 1 3 3 3 3 4 4 2 2 2 • .inp format for Phase • Format for TASSEL (association studies) 33 10 P 49 121 217 244 260 289 SSSSSSSSSS #cARB A A G G T C C A T T A A G G T C C A T T #cSYR A A G A T C C A T C A A G G T C C A T T 33 10:2 50 122 218 245 261 290 356 461 467 560 cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:T cARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T Alexis Dereeper CIBA courses – Brasil 2011

  18. Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011

  19. Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011

  20. Diversity analysis SeqLib library

  21. Low frequency haplotype Distance between 2 haplotypes (nb of mutations) High frequency haplotypes Group distribution whithin this haplotype Haplotype networks Alexis Dereeper CIBA courses – Brasil 2011

  22. Allele sharing between groups External file (optional) Individu, group Ind1, Table Ind2, Table Ind3, Table Ind4, East Ind5, East Ind6, East Ind7, East Ind8, West Alexis Dereeper CIBA courses – Brasil 2011

More Related