410 likes | 564 Views
Variant Calling Workshop. Overview. There will be two parts to the workshop: Variant calling analysis (on the cluster) Visualization (on the desktop) using IGV Command prompts (what you will type) will be in boxes preceded by ‘$’ . Output will be in red:. $ mkdir foo $ cd foo
E N D
Overview • There will be two parts to the workshop: • Variant calling analysis (on the cluster) • Visualization (on the desktop) using IGV • Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red: $ mkdir foo $ cd foo $ ls -la total 96 drwxrwxr-x 2 cjfieldscjfields 32768 Jun 23 22:51 . drwxr-x--- 39 cjfieldscjfields 32768 Jun 23 22:51 ..
Prelude : Variant Calling Setup • Log into the cluster using your classroom account. • Create a work folder (I call mine ‘mayo_test’): $ mkdirmayo_test $ cd mayo_test $ ll total 0
Part Ia : Variant Calling Setup • Link in all scripts from the main work folder to this directory: $ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh . $ ls annotate_snpeff.shcall_variants_ug.shhard_filtering.shpost_annotate.sh
Part Ia : Variant Calling Setup • Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage • The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reducedfor a portion of human chromosome 20 • Otherwise, we would not even finish the alignment within the next few days, let alone the other steps
Part Ia: Variant Calling • Start the variant calling job. Check the status of the job using ‘qstat’: $ qsubcall_variants_ug.sh 371875.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371875.biocluste cjfields default call_variants_ug 24285 1 4 10gb -- R 00:01
Part Ia: Variant Calling • Discussion: what did we just do? • We ran the GATK UnifiedGenotyper to call variants • Show the script…
Part Ia: Variant Calling • Job done yet? Should only be a few minutes… • What do the data look like? (anyone here use UNIX?) $ qstat -u <YOUR USER NAME> $ ll *vcf* -rw-rw-r-- 1 cjfieldscjfields 237060 Jun 23 23:10 raw_indels.vcf -rw-rw-r-- 1 cjfieldscjfields 2829 Jun 23 23:10 raw_indels.vcf.idx -rw-rw-r-- 1 cjfieldscjfields 3203447 Jun 23 23:08 raw_snps.vcf -rw-rw-r-- 1 cjfieldscjfields 107241 Jun 23 23:08 raw_snps.vcf.idx $ tail -n 2 raw_indels.vcf 20 26306897 rs200138621 CAGA C 1305.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRankSum=0.407;QD=5.80;ReadPosRankSum=0.371 GT:AD:DP:GQ:PL 0/1:44,26:75:99:1343,0,2526 20 26314306 rs199619140 GT G 1502.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRankSum=-1.411;QD=18.11;ReadPosRankSum=1.387 GT:AD:DP:GQ:PL 0/1:33,36:76:99:1540,0,1253
Part Ia: Variant Calling • How many SNPs and Indels were called? • Any found in dbSNP? $ grep -c -v '^#' raw_snps.vcf 13621 $ grep -c -v '^#' raw_indels.vcf 1070 $ grep -c 'rs[0-9]*' raw_snps.vcf 12245 $ grep -c 'rs[0-9]*' raw_indels.vcf 1019
Part Ib : Hard filtering • We need to filter the variant calls • Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering
Part Ib : Hard filtering • Start the hard filtering step. This will be fast: • You will have two new VCF files in a minute: • hard_filtered_snps.vcf • hard_filtered_indels.vcf $ qsubhard_filtering.sh 371886.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371886.biocluste cjfields default hard_filtering.s 24455 1 4 10gb -- R --
Part Ib : Hard filtering • What are we doing? • <Show the code!> • Questions: • Did we lose any variants? • How many PASS’ed the filter? • What is the difference in the filtered and raw output?
Part Ib : Hard filtering • What are we doing? • <Show the code!> • Questions: • Did we lose any variants? • How many PASS’ed the filter? • What is the difference in the filtered and raw output? $ grep -c 'PASS' hard_filtered_snps.vcf 8270 $ grep -c 'PASS' hard_filtered_indels.vcf 1041
Part Ic: Annotate the variants (SnpEff) • Run the next job, which uses SnpEff to add annotation to the VCF: • This takes a couple of minutes… • Two new VCF: • hard_filtered_snps_annotated.vcf • hard_filtered_indels_annotated.vcf $ qsubannotate_snpeff.sh 371894.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371894.biocluste cjfieldsdefaultannotate_snpeff. 18957 1 1 10gb -- R --
Part Ic : Annotate the variants (SnpEff) • SnpEff adds information about where the variants are in relation to specific genes • The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX) • The Ensembl ID for FOXA2 is ENSG00000125798
Part Ic : Annotate the variants (SnpEff) • The Ensembl ID for FOXA2 is ENSG00000125798 • Are there any variants called for FOXA2?
Part Ic : Annotate the variants (SnpEff) • The Ensembl ID for FOXA2 is ENSG00000125798 • Are there any variants called for FOXA2? • SnpEff also creates some additional output files; we’ll see those in a bit $ grep -c 'ENSG00000125798' hard_filtered_snps_annotated.vcf 3 $ grep -c 'ENSG00000125798' hard_filtered_indels_annotated.vcf 0
Part Id : GATK VariantAnnotator • SnpEff adds a lot of information to the VCF. • GATK VariantAnnotator helps remove a lot of the extraneous information
Part Id : GATK VariantAnnotator • The last step: • This may take about 5-10 minutes $ qsubpost_annotate.sh 371905.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371905.biocluste cjfieldsdefaultpost_annotate.sh 24650 1 4 10gb -- R 00:01
While this is going on… • Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)
Prelude to Part II • We need to download the results from your user folders to the local desktop • We’ll use FileZilla for this
FileZilla • Transfer folder to the desktop
Part II : Viewing Results in IGV • Open IGV • Switch genome to ‘Human (b37)’
Part II : Viewing Results in IGV • Load the VCF files marked ‘final’ • Click on the ‘20’ (chr 20)
Part II : Viewing Results in IGV • Click and drag from the ‘20 mb’ mark to roughly the centromeric region on the chromosome (~2.6 mb)
Part II : Viewing Results in IGV • Click and drag from the ‘20 mb’ mark to roughly the centromeric region on the chromosome (~2.6 mb)
Part II : Viewing Results in IGV • Should look something like this:
Part II : Viewing Results in IGV • Right click on the track to bring up a menu, and then select ‘Set Feature Visibility Window’ • Set to ‘10000000’ (10 million, or 10 mb)
Part II : Viewing Results in IGV • In the search box, enter in ‘FOXA2’ • Browser jumps to that gene
Part II : Viewing Results in IGV • How many SNPs are here? • How many indels? • How many SNPs are ‘hets’?
Part II : Viewing Results in IGV • Now we’ll load in a ‘small’ BAM file. • This is the same BAM file you analyzed on the cluster • In the ‘gatk_bundle’ folder there is a BAM file named ‘NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20_arm1.bam’. Load that file. What is happening here?
Part II : Viewing Results in IGV • Right click on BAM track and select ‘Show coverage track’ • Note colors in coverage track
Part II : Viewing Results in IGV • Right click on BAM track and select ‘Show coverage track’ • Note colors in coverage track
Part II : Viewing Results in IGV • Zoom in to regions with SNPs • Mouse-over the SNP regions in the VCF tracks to get more information (including annotation from SnpEff and VariantAnnotator)
Browse around… • How are filtered SNPs are displayed? • How about low-quality bases? • Can you change the display to… • Sort BAM reads by strand?
Part III : SnpEff Results • SnpEff gives a nice summary HTML file that should have been downloaded with your result (one for SNP, one for indel results) • Open those up in a browser.