210 likes | 322 Views
SNP Scores. Overall Score. Coverage Score * 4 optional scores Read Balance Score = 1 if reads are balanced in each direction Allele Balance Score = 1 if SNP count is balanced in relation to the read count in each direction Homopolymer Score
E N D
Overall Score • Coverage Score * 4 optional scores • Read Balance Score • = 1 if reads are balanced in each direction • Allele Balance Score • = 1 if SNP count is balanced in relation to the read count in each direction • Homopolymer Score • = 1 if the SNP is not an indel in a homopolymer • Mismatch score • = 1 if there are fewer than 3 SNPs present within 10 bp on either side of the SNP that occur in a minimum number of reads • Maximum score = 30*1*1*1*1 = 30
Interpreting the Score • Scores are an empirical estimation of how likely it is that a given SNP is real and not an artifact of sequencing or alignment • The score is based on Phred scores • 30 = 1 in 1000 are not real • 20 = 1 in 100 are not real • 10 = 1 in 10 are not real
Interpreting the Score • A low score does not mean the mutation is more likely to be false- it only means the mutation cannot be confidently called as a true mutation. • Even real SNPs will have low scores if the coverage is low.
Optional Scores • The optional scores can be ignored (set equal to 1) in the final score calculation by adjusting the settings for the mutation report. As you can see, The homopolymer score is always ignored unless it is Roche data.
Optional Score • You may want to ignore certain optional scores depending on your data • For example: If your data is all (or nearly all) one directional you can ignore choose to ignore the Read Balance score because even real SNPs will not be balanced • Homopolymer scores are automatically ignored unless Roche data is being analyzed
Coverage Score • If the SNP count is greater than 50 (example: 50% SNP at 100 coverage) then the score is 30. • Otherwise the score is calculated according to this formula: % = SNP Allele Percentage # = SNP Count
Coverage Score • This score is based on the Gompertz function where a, b, and c have been adjusted to achieve the desired distribution.
Coverage Score Distribution • Different SNP Percentages vs Read Coverage: Score 10 Higher % SNPs are more reliable at low coverage
Coverage Score Distribution • Different Levels of Coverage vs SNP % Low coverage will limit the score even if a SNP occurs in a high percentage of reads
Read Balance Score • If the number of forward and reverse reads is within 1, then the score is 1. • If not the score is calculated according to this formula: #F = number of forward reads C = Coverage
Read Balance Score • When sequence data has reads present in both directions it is more reliable because the base quality is averaged out between the high quality 5’ end and the low quality 3’ end. • A score of 1 means there is no penalty. A score below 1 reduces the score from the Coverage Score.
Read Balance Score Distribution Levels of Coverage vs Percent of Reads in the Forward Direction Percent of Reads in One Direction vs Coverage Lower coverage results in a higher penalty because the balance is more likely to be random
Allele Balance • The Allele Balance score penalizes SNPs that occur at different frequencies in the forward and reverse directions because they are more likely to be sequencing or alignment errors. • The score is based on a Yate’s chi-square test which is less likely than normal chi-square tests to reject the null hypothesis due to a lack of data (low coverage in this case).
Allele Balance • First a variable is calculated: • W = |(#F SNP)*(#R non-SNP) – (#R SNP)*(# F non-SNP)|- C/2 • If this variable is negative then the score is 1. • Otherwise, the score is calculated according to the equation: #F = number of forward reads #R = number of reverse reads
Allele Balance Distribution Vary Imbalance Score vs Number of Forward SNPs 100 reads in each direction, 50% SNPs Vary Coverage Score vs Coverage Balanced reads, 2:1 SNP Balance, 30% SNPs Vary SNP Percentage Score vs percent of reads with a SNP allele 300 reads in each direction, 2:1 SNP Balance
Homopolymer Score • The homopolymer score penalizes indels in homopolymer regions when analyzing Roche pyrosequencing data because they are usually a sequencing error. • The penalty is higher for longer homopolymer regions because error is more likely.
Homopolymer Score • The program first determines which length of homopolymer region is present more often (A) and less often (B) • If A or B is not ≥ 3 then the score is 1 • Otherwise the score is calculated according to the formula: Example: A deletion from 4 bases to 3 bases that occurs less than half of the time: A = 4, B = 3, Score = 0.5
Mismatch Score • Several SNPs occurring very close together is usually the result of an alignment error. This score penalizes a SNP if there are other SNPs nearby. • The program first looks for SNPs that occur in a minimum percentage of reads in the 10 bp on either side of the SNP being scored. The number of SNPs is used to calculate the score. • If the number of nearby SNPs is less than 3 there is no penalty.
Mismatch Score Distribution • After the number of nearby SNPs is determined the score is calculated according to the formula: • This results in the following distribution: