170 likes | 292 Views
ParSNP Hash. Pipeline to parse SNP data and output summary statistics across sliding windows. Objective. Parse VCF files Calculate summary statistics across sliding windows throughout the genome
E N D
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows
Objective • Parse VCF files • Calculate summary statistics across sliding windows throughout the genome • Implement NTFreq module to calculate nucleotide frequencies for each population and combined populations • Implement TajimasD module to calculate Tajima’s D • Implement GO module to annotate identified SNPs
Data set • Simulated data set for chromosome 2R in Drosophila melanogaster • 1.4 Mbp • 2 populations • Pooled individuals per population • 75bp reads, error rate 1% • 10,000 simulated SNPs • 100x coverage per variant • At least 100bp apart • Allelic Frequencies ranging from .1 to .9 per population
FastQ -> sai -> SAM -> BAM - > .bcf -> VCF Data to Variant Call Format Index Reference Genome Only chromosome 2R of D. melanogaster -Genome build Dmel 3 from Flybase Use BWA to Align FastQ to Reference Genome Gap open penalty = 1 Disallowing deletion within 12bp of 3’UTR Gap extension max = 12 Maximum level of gap extensions = 12 Seeding length of 35 bp Use SAMTools to Remove Ambiguously mapped Regions (MAPQ >= 20) Use BCFToolsmpileup to Generate a Binary Code Format (BCF) BCF -> VCF
Formatting data: Parse VCF For each window: • Fetch the VCF rows from each BCF file • Convert the VCF rows into hashes of arrays • Compute the Theta, Pi, Tajima’s D for eachpopulation • Compute Fst for each window between each population
Sliding windows • Sliding window size is specified by user • Called modules are calculated across specified window size
Module 1: Calculate allele frequencies • Input is taken from parsed VCF file • Hashes are created for each population with the following structure • {SNP_location} {nucleotide} -> frequency; • Hashes created for full dataset • {SNP_location}{Population} -> {nucleotide} -> frequency
Output site frequency spectra • Site frequency spectrum (SFS) output as the following hash: • $SFS{frequency}->count; • Allows us to calculate a histogram for the non-reference allele frequencies • Send output to R to generate SFS graphs
Module 2: Calculate Summary Statistics and Tajima’s D • theta_pi (index of diversity) • theta_watterson (index of diversity)
Module 2: Calculate Summary Statistics and Tajima’s D • Tajima’s D (index of selection/population expansion)
Module 3: FST for DNA sequence • Calculate FST (index of differentiation) according to Hudson et al. 1992 1 – Hw/Hb Hw: average number of differences within each population Hb: average number of differences between the 2 populations
Module 4: GO annotations • Module takes SNP list as input • Outputs the following: • List of genes that have overlap with SNP positions • Gene Ontology (GO) IDs and terms associated with each SNP matched gene • List of genes for a selected window • Visualization using GOSlim
Data visualization • Integrated Genomics Viewer (IGV) • Broad Institute • http://www.broadinstitute.org/igv/
Sliding window for summary statistics Phist greater than 0.1 in window 1080001 - 1100000 Go Accession ID Ontology Specific GO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complex GO:0005703 Cellular Component (Thought to be a site of active transcription) GO:0005634 Cellular Component (Nucleus) GO:0006911 Biological Process Phagosome biosynthesis/formation GO:0045747 Biological Process Up regulation of Notch signaling pathway GO:0006355 Biological Process Regulation of cellular transcription, DNA-dependent GO:0000910 Biological Process (Cytoplasm division) GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group) GO:0005700 Cellular Component (Polytene associated) GO:0005488 Molecular Function (Ligand, non-covalent partner) GO:0005737 Cellular Component (Ambiguous) GO:0035222 Biological Process (Patterning in wing imaginal disc) GO:0005875 Cellular Component (Microtubule associated) GO:0004672 Molecular Function Protaminekinase activity GO:0000123 Cellular Component Histoneacetylase complex
Identify differentiated genomic regions • For each window with a Fst > 0.1, print the name of the SNP and associated GO term Phist (Fst) greater than 0.1 in window 1080001 - 1100000 Go Accession ID Ontology Specific GO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complex GO:0005703 Cellular Component (Thought to be a site of active transcription) GO:0005634 Cellular Component (Nucleus) GO:0006911 Biological Process Phagosome biosynthesis/formation GO:0045747 Biological Process Regulation of cellular transcription, DNA-dependent GO:0000910 Biological Process (Cytoplasm division) GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group)GO:0005700 Cellular Component (Polytene associated) GO:0005488 Molecular Function (Ligand, non-covalent partner) GO:0005737 Cellular Component (Ambiguous) GO:0035222 Biological Process (Patterning in wing imaginal disc) GO:0005875 Cellular Component (Microtubule associated) GO:0004672 Molecular Function Protaminekinase activity GO:0000123 Cellular Component Histoneacetylase complex
Thank You Use PERLordie,print“ (X_x)”; ##Hashes to Hashes## Print“ % 2 %”; __END__