380 likes | 544 Views
Basic terms:. Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Identity percentage Homology -specific term indicating relationship by evolution. Basic terms:.
E N D
Basic terms: • Similarity - measurable quantity. • Similarity- applied to proteins using concept of conservative substitutions • Identity • percentage • Homology-specific term indicating relationship by evolution
Basic terms: • Orthologs: homologous sequences found in two or more species, that have the same function (i.e. alpha- hemoglobin).
Basic terms: • Orthologs: homologous sequences found it two or more species, that have the same function (i.e. alpha- hemoglobin). • Paralogs: homologous sequences found in the same species that arose by gene duplication. ( alpha and beta hemoglobin).
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity. • Typically only one direction makes biological sense.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity. • Typically only one direction makes biological sense. • 5’ to 3’ or amino terminus to carboxyl terminus.
Simple plot • Window: size of sequence block used for comparison. In previous example: • window = 1 • Stringency = Number of matches required to score positive. In previous example: • stringency = 1 (required exact match)
DotPlot WINDOW = 4; STRINGENCY = 2 GATCGTACCATGGAATCGTCCAGATCA GATC + (4/4) GATC - (0/4) GATC - (0/4) GATC + (2/4)
Dot Plot • Compare two sequences in every register. • Vary size of window and stringency depending upon sequences being compared. • For nucleotide sequences typically start with window = 21; stringency = 14 • Protein - start with smaller window : 3, stringency 1 or 2. • Important to test different stringencies.
Intergenic comparison • Nucleotide sequence contains three domains. • 50 - 350 - Strong conservation • Indel places comparison out of register • 450 - 1300 - Slightly weaker conservation • 1300 - 2400 - Strong conservation
Scoring Alignments • Quality Score: • Score x for match, -y for mismatch;
Scoring Alignments • Quality Score: • Score x for match, -y for mismatch; • Penalty for: • Creating Gap • Extending a gap
Scoring Alignments • Quality Score: • Quality = [10(match)]
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)]
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps)
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model--
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model-- Matches are conserved
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model-- Matches are conserved Mismatches are divergences
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model-- Matches are conserved Mismatches are divergences Gaps are more likely to disrupt function, hence greater penalty than mismatch.
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model-- Matches are conserved Mismatches are divergences Gaps are more likely to disrupt function, hence greater penalty than mismatch. Introduction of a gap (indel) penalized more than extension of a gap.
Z Score (standardized score) • Z = (Scorealignment - Average Scorerandom) Standard Deviationrandom
Quality Score:Randomization • Program takes sequence and randomizes it X times (user select). • Determines average quality score and standard deviation with randomized sequences • Compare randomized scores with Quality score to help determine if alignment is potentially significant.
Randomization • It has become clear that • Sequences appear to evolve in a “word” like fashion. • 26 letters of the alphabet--combined to make words. • Words actually communicate information. • Randomization should actually occur at the level of strings of nucleotides (2-4).
Global Alignment • Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps.
Global Alignment • Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps. • Alignment will “run” from one end of the longest sequence, to the other end.
Global Alignment • Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps. • Alignment will “run” from one end of the longest sequence, to the other end. • Best for closely related sequences.
Global Alignment • Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps. • Alignment will “run” from one end of the longest sequence, to the other end. • Best for closely related sequences. • Can miss short regions of strongly conserved sequence.
Local Alignment • Identifies segments of alignment with the highest possible score.
Local Alignment • Identifies segments of alignment with the highest possible score. • Align sequences, extends aligned regions in both directions until score falls to zero.
Local Alignment • Identifies segments of alignment with the highest possible score. • Align sequences, extends aligned regions in both directions until score falls to zero. • Best for comparing sequences whose relationship is unknown.
Global Alignment: Local Alignment:
Blast 2 Basic Local Alignment Search Tool E (expect) value: number of hits expected by random chance in a database of same size. Larger numerical value = lower significance HIV sequence
Both Global and Local alignment programs will (almost) always give a match.
Both Global and Local alignment programs will (almost) always give a match. • It is important to determine if the match is biologically relevant.
Both Global and Local alignment programs will (almost) always give a match. • It is important to determine if the match is biologically relevant. • Not necessarily relevant: Low complexity regions. • Sequence repeats (glutamine runs)
Both Global and Local alignment programs will (almost) always give a match. • It is important to determine if the match is biologically relevant. • Not necessarily relevant: Low complexity regions. • Sequence repeats (glutamine runs) • Transmembrane regions (high in hydrophobes)
Both Global and Local alignment programs will (almost) always give a match. • It is important to determine if the match is biologically relevant. • Not necessarily relevant: Low complexity regions. • Sequence repeats (glutamine runs) • Transmembrane regions (high in hydrophobes) • If working with coding regions, you are typically better off comparing proteinsequences. Greater information content.