250 likes | 443 Views
Detection and analysis of SNP polymorphisms. Alexis Dereeper. CIBA courses – Brasil 2011. Objectives. Short reads Solexa. To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data). Mapping SAM.
E N D
Detection and analysis of SNP polymorphisms Alexis Dereeper CIBA courses – Brasil 2011
Objectives Short reads Solexa • To know and manipulate available packages/tools for SNP and INDEL detection from NGS data • (assembly of NGS data) Mapping SAM • To think about difficulties encountered when analysing new generation sequencing data • (differentiate sequencing errors, paralogs and allelic variation) Allelic variations • Detect SNP and assign genotypes to every polymorphic positions A/G 1998 T/C 2341 T/G List of SNPs • Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD) Ind1 ATTGTGTCGTAACGTATGTCATGTCGT Ind2 ATTGTGTCGGAACGTATGTCATGTCGT Ind3 ATTGTGTCGKAACGTATGTCATGTCGT • Obtain an exploitable dataset to send for the design of a high-throughput SNP chip • (Illumina VeraCode technology) Assignation of genotypes Exploitation of polymorphism data Design of a Illumina SNP chip
Tablet • Graphical viewer for assembly of NGS data • Accepts different formats: • ACE, SAM, BAM Alexis Dereeper CIBA courses – Brasil 2011
Automatic detection of SNP from SAM assembly Fastq Example of pipeline faisable with the Galaxy system: 3 alternatives FastQ Groomer PicardTools SamTools Mapping BWA GATK SAM assembly VarScan AddReadGroupIntoSam SNiPlay Utilities SAM-to-BAM SAM-to-BAM Generate Pileup SamToFastaAlignments IndelRealigner Pileup file FASTA alignments with IUPAC CountCovariates TableRecalibration Pileup2snp UnifiedGenotyper SNP tabular file VCFToFastaAlignments VCF file Alexis Dereeper CIBA courses – Brasil 2011
Varscan • Program for SNP detection from Pileup file : Pileup2snp • Another module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen Pileup format Text file describing for each position: base for reference, depth of coverage, variations, quality seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Alexis Dereeper CIBA courses – Brasil 2011
SamToFastaAlignments and AceToFastaAlignments: SNiPlay utilities for management of NGS data Mapping: SAM format Threshold values per genotype Assemblage: Ace format Depth Frequency Depth CL1Contig1 genotype1 0 1 1 genotype2 0.3 4 2 genotype3 0.3 4 2 Depth threshold Depth threshold Heterozygosity For heterozygosity estimation For position For each contig List of heterozygous positions Stats: estimation of average heterozygosity for each genotype + + FASTA alignments including IUPAC A CL1Contig1.align.fa A Y T W + CL1Contig2.align.fa , CL2Contig1.align.fa … Alexis Dereeper CIBA courses – Brasil 2011
GATK (Genome Analysis ToolKit) • Package for analysis of NGS data. • Developed for the analysis of Human medical resequencing projects • (1000 Genomes, The Cancer Genome Atlas) • Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery • Complementary of 2 other packages: SamTools, PicardTools PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer) Alexis Dereeper CIBA courses – Brasil 2011
Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) FastQ Groomer FastQ Groomer FastQ Groomer FastQ Groomer …. Mapping BWA Mapping BWA Mapping BWA Mapping BWA AddReadGroupIntoSam AddReadGroupIntoSam AddReadGroupIntoSam AddReadGroupIntoSam SAM with read group SAM with read group SAM with read group SAM with read group mergeSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file
Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) Fastq global FastQ Groomer Mapping BWA AddReadGroupIntoSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file
VCF format (Variant Call Format) Advantages: describes the variations for each position + genotype assignation ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 Alexis Dereeper CIBA courses – Brasil 2011
Other functionalities of GATK • DepthOfCoverage module: • Enables to inform sequencing depth of coverage for each gene, each position and each individual • ReadBackedPhasing module: • Enables to define if possible allele association (phase or haplotype) in case of heterozygosity… And not AGG GGA Alexis Dereeper CIBA courses – Brasil 2011
SNiPlay: Web-based application for polymorphism analysis http://sniplay.cirad.fr Alexis Dereeper CIBA courses – Brasil 2011
Automatic detection of SNP from SAM assembly Fastq Example of pipeline faisable with the Galaxy system: 3 alternatives FastQ Groomer PicardTools SamTools Mapping BWA GATK SAM assembly VarScan AddReadGroupIntoSam SNiPlay Utilities SAM-to-BAM SAM-to-BAM Generate Pileup SamToFastaAlignments IndelRealigner Pileup file FASTA alignments with IUPAC CountCovariates TableRecalibration Pileup2snp UnifiedGenotyper SNP tabular file VCFToFastaAlignments VCF file Alexis Dereeper CIBA courses – Brasil 2011
Options of SNiPlay Select the VCF format Load the VCF file Load reference file Select the Rice genome as reference
Alexis Dereeper CIBA courses – Brasil 2011
Design of Illumina chip Submission file for Illumina Genotyping file Analysis with the BeadStudio software Cartesian coordinates Alexis Dereeper CIBA courses – Brasil 2011
Allelic files cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4 cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4 cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4 • PED format • DARwin format @DARwin 5.0 - ALLELIC - 2 33 20 N° 50 50 122 122 218 218 245 245 261 261 290 290 356 1 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 1 1 3 3 1 3 4 4 2 2 2 3 1 1 1 1 3 3 3 3 4 4 2 2 2 4 1 1 1 1 3 3 3 3 4 4 2 2 2 • .inp format for Phase • Format for TASSEL (association studies) 33 10 P 49 121 217 244 260 289 SSSSSSSSSS #cARB A A G G T C C A T T A A G G T C C A T T #cSYR A A G A T C C A T C A A G G T C C A T T 33 10:2 50 122 218 245 261 290 356 461 467 560 cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:T cARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T cLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T Alexis Dereeper CIBA courses – Brasil 2011
Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011
Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011
Diversity analysis SeqLib library
Low frequency haplotype Distance between 2 haplotypes (nb of mutations) High frequency haplotypes Group distribution whithin this haplotype Haplotype networks Alexis Dereeper CIBA courses – Brasil 2011
Allele sharing between groups External file (optional) Individu, group Ind1, Table Ind2, Table Ind3, Table Ind4, East Ind5, East Ind6, East Ind7, East Ind8, West Alexis Dereeper CIBA courses – Brasil 2011