Variant Calling Workshop

Variant Calling Workshop

Overview • There will be two parts to the workshop: • Variant calling analysis (on the cluster) • Visualization (on the desktop) using IGV • Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red: $ mkdir foo $ cd foo $ ls -la total 96 drwxrwxr-x 2 cjfieldscjfields 32768 Jun 23 22:51 . drwxr-x--- 39 cjfieldscjfields 32768 Jun 23 22:51 ..

Prelude : Variant Calling Setup • Log into the cluster using your classroom account. • Create a work folder (I call mine ‘mayo_test’): $ mkdirmayo_test $ cd mayo_test $ ll total 0

Part Ia : Variant Calling Setup • Link in all scripts from the main work folder to this directory: $ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh . $ ls annotate_snpeff.shcall_variants_ug.shhard_filtering.shpost_annotate.sh

Part Ia : Variant Calling Setup • Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage • The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reducedfor a portion of human chromosome 20 • Otherwise, we would not even finish the alignment within the next few days, let alone the other steps

Part Ia: Variant Calling • Start the variant calling job. Check the status of the job using ‘qstat’: $ qsubcall_variants_ug.sh 371875.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371875.biocluste cjfields default call_variants_ug 24285 1 4 10gb -- R 00:01

Part Ia: Variant Calling • Discussion: what did we just do? • We ran the GATK UnifiedGenotyper to call variants • Show the script…

Part Ia: Variant Calling • Job done yet? Should only be a few minutes… • What do the data look like? (anyone here use UNIX?) $ qstat -u <YOUR USER NAME> $ ll *vcf* -rw-rw-r-- 1 cjfieldscjfields 237060 Jun 23 23:10 raw_indels.vcf -rw-rw-r-- 1 cjfieldscjfields 2829 Jun 23 23:10 raw_indels.vcf.idx -rw-rw-r-- 1 cjfieldscjfields 3203447 Jun 23 23:08 raw_snps.vcf -rw-rw-r-- 1 cjfieldscjfields 107241 Jun 23 23:08 raw_snps.vcf.idx $ tail -n 2 raw_indels.vcf 20 26306897 rs200138621 CAGA C 1305.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRankSum=0.407;QD=5.80;ReadPosRankSum=0.371 GT:AD:DP:GQ:PL 0/1:44,26:75:99:1343,0,2526 20 26314306 rs199619140 GT G 1502.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRankSum=-1.411;QD=18.11;ReadPosRankSum=1.387 GT:AD:DP:GQ:PL 0/1:33,36:76:99:1540,0,1253

Part Ia: Variant Calling • How many SNPs and Indels were called? • Any found in dbSNP? $ grep -c -v '^#' raw_snps.vcf 13621 $ grep -c -v '^#' raw_indels.vcf 1070 $ grep -c 'rs[0-9]*' raw_snps.vcf 12245 $ grep -c 'rs[0-9]*' raw_indels.vcf 1019

Part Ib : Hard filtering • We need to filter the variant calls • Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering

Part Ib : Hard filtering • Start the hard filtering step. This will be fast: • You will have two new VCF files in a minute: • hard_filtered_snps.vcf • hard_filtered_indels.vcf $ qsubhard_filtering.sh 371886.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371886.biocluste cjfields default hard_filtering.s 24455 1 4 10gb -- R --

Part Ib : Hard filtering • What are we doing? • <Show the code!> • Questions: • Did we lose any variants? • How many PASS’ed the filter? • What is the difference in the filtered and raw output?

Part Ib : Hard filtering • What are we doing? • <Show the code!> • Questions: • Did we lose any variants? • How many PASS’ed the filter? • What is the difference in the filtered and raw output? $ grep -c 'PASS' hard_filtered_snps.vcf 8270 $ grep -c 'PASS' hard_filtered_indels.vcf 1041

Part Ic: Annotate the variants (SnpEff) • Run the next job, which uses SnpEff to add annotation to the VCF: • This takes a couple of minutes… • Two new VCF: • hard_filtered_snps_annotated.vcf • hard_filtered_indels_annotated.vcf $ qsubannotate_snpeff.sh 371894.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371894.biocluste cjfieldsdefaultannotate_snpeff. 18957 1 1 10gb -- R --

Part Ic : Annotate the variants (SnpEff) • SnpEff adds information about where the variants are in relation to specific genes • The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX) • The Ensembl ID for FOXA2 is ENSG00000125798

Part Ic : Annotate the variants (SnpEff) • The Ensembl ID for FOXA2 is ENSG00000125798 • Are there any variants called for FOXA2?

Part Ic : Annotate the variants (SnpEff) • The Ensembl ID for FOXA2 is ENSG00000125798 • Are there any variants called for FOXA2? • SnpEff also creates some additional output files; we’ll see those in a bit $ grep -c 'ENSG00000125798' hard_filtered_snps_annotated.vcf 3 $ grep -c 'ENSG00000125798' hard_filtered_indels_annotated.vcf 0

Part Id : GATK VariantAnnotator • SnpEff adds a lot of information to the VCF. • GATK VariantAnnotator helps remove a lot of the extraneous information

Part Id : GATK VariantAnnotator • The last step: • This may take about 5-10 minutes $ qsubpost_annotate.sh 371905.biocluster.igb.illinois.edu $ qstat -u <YOUR USER NAME> biocluster.igb.illinois.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 371905.biocluste cjfieldsdefaultpost_annotate.sh 24650 1 4 10gb -- R 00:01

While this is going on… • Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)

Prelude to Part II • We need to download the results from your user folders to the local desktop • We’ll use FileZilla for this

FileZilla

FileZilla • Transfer folder to the desktop

Part II : Viewing Results in IGV • Open IGV • Switch the genome to ‘Human (b37)’

Variant Calling Workshop