Sequence Analysis

Sequence Analysis Millions of entries of protein and nucleotide data now in databases - How to convert this to useful information? Sequence analysis - explore newly determined sequence = determination of gene or regulatory sequences, determination of homologous sequences, comparison w/ databases, determination of function, determination of evolutionary history

Sequence Change Even homologous sequences differ - mutation/selection through time (evolution) Genes can also differ b/c of duplications (paralogs vs orthologs) and pseudogenes Changes can “mask” underlying sequence similarity

Sequence alignment seeks to line-up (align) homologous bases - bases that are descendants of a common ancestral residue Similarity/Identity (match) does not equal homology Compare ACGCTGA and ACTGT ACGCTGA ACGCTGA A--CTGT ACTGT— Sequence Alignment

Sequence similarity can be either result from random chance, convergent evolution or from a shared evolutionary origin (homology) Homologous sequences likely to have similar functions Sequence Alignment vs

Sequence Alignment Alignments can be given scores, e.g. -1 for each substitution, -5 for an indel, +3 for a match These are then scored 9,5,4,4 Overall score can then be used to determine “best” alignment

Align AGCGTAT and ACGGTAT AGCGTAT AGC-GTAT AGCG-TAT |••|||| |-|-|||| |-||-||| ACGGTAT A-CGGTAT A-CGGTAT Which alignment is “best” depends on the gap penalty Gap penalty -5: 2nd two alignments both score (6x3)-(2x5)=8 Gap penalty -1: (6x3)-(2x1) = 16 Either gap penalty: 1st alignment scores (5x3)-(2x1) = 13 Sequence Alignment - score

Align THISSEQUENCE and THATSEQUENCE THISSEQUENCE THISSEQUENCE ||••|||||||| THATSEQUENCE THATSEQUENCE More divergent sequences are more difficult to compare THATSEQUENCE and THISISASEQUENCE THATSEQUECNE THISISASEQUENCE THISISA-SEQUENCE TH----ATSEQUENCE An alignment is a hypothesis about which residues evolved from the same ancestral residue = homology Sequence Alignment - gaps

THISISA-SEQUENCE TH----ATSEQUENCE An alignment is a hypothesis about which residues evolved from the same ancestral residue Comparisons need to take into consideration various factors: types of mutations (transitions/transversions), difference in physicochemical properties of amino acids and role in protein structure and function = evolutionary processes Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments Sequence Alignment = hypothesis

Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments Amino acids are usually easier to align than nucleotides 4 letter nucleotide codes has less information than 20 letter amino acid code - greater probability of match by chance in DNA than amino acids Amino acid vs nucleotide

Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments Amino acids are usually easier to align than nucleotides 4 letter nucleotide codes has less information than 20 letter amino acid code - greater probability of match by chance in DNA than amino acids Alignments can also take similarity (lysine/arginine vs lysine/glutamate) into consideration Genetic code is redundant - will change through time and not alter amino acid sequence, but amino acid sequence determines structure and function of the protein In some cases, only nucleotides seq. can be compared (gene id, regulatory DNA, etc) Amino acid vs nucleotide

Align AGCGTAT and ACGGTAT AGCGTAT AGC-GTAT AGCG-TAT |••|||| |-|-|||| |-||-||| ACGGTAT A-CGGTAT A-CGGTAT Which alignment is “best” depends on the gap penalty Gap penalty -5 = 2nd two alignments both score = 8 Gap penalty -1 = 16 Either gap penalty: 1st alignment scores = 13 Best score is the optimal alignment, others are suboptimal Assumption is alignment of related seq. will give a better score than random sequences No algorithms yet incorporates complete evol. theory, but many yield reasonable results Sequence Alignment - Score

The simplest way of quantifying similarity between 2 sequences is percentage (percent) identity - actual identity, an objective measure THISISA-SEQUENCE TH----ATSEQUENCE 11/16 = 68.75% Even unrelated sequences will have some amount of sequence identity (less in aa than nuc.), but this will decrease w/ the amount of sequence compared Quantifying similarity

Percent identity is relatively crude - genuine matches do not have to be completely identical - homologous amino acids are often similar, not identical Similarity

THISISA-SEQUENCE TH----ATSEQUENCE THISISASEQUENCE || |||||||| THAT---SEQUENCE Isoleucine and alanine (hydrophobic) are similar amino acids, as are serine and threonine (polar) Not all similar amino acids are equally likely to occur Other factors, like cysteine residues and disulfide bridges and tryptophan in hydrophobic structure can also be factored in - summing all values gives an overall alignment score Scores not nec. simple to interpret and will change w/ length Percent Similarity

A comparison of >1 million protein sequences w/ structural information suggests that 90% of sequence pairs w/ identity of 30% or greater over their entire length were structurally similar proteins Below 25% identity, 10% of pairs represented structural similarity. 30-25% is the twilight zone. Even lower sequence identity (<20%) is the midnight zone There are many different ways to score alignments, some more common w/ some applications than others. In all they must score both the degree of relatedness between residues (from a presumed common ancestor) and the validity of gaps Minimum percent identity

Dot-plot THISISA-SEQUENCE TH----ATSEQUENCE Dot-plots give a visual assessment of similarity based on identity for either aa or nuc. One sequence, X, is written out horizontally and the second, Y, vertically. Each residue compared in a row to column comparison. Dots are placed if residues are identical

Dot-plot Dots are placed if residues are identical Here red dots indicate identical residues and breaks represent points where gaps are needed Pink dots indicate residues that are also present elsewhere in sequence

Dot-plot Dot-plots can suffer from noise caused by regions of similarity arising by chance “filters” are often used to remove this - overlapping, fixed length windows (e.g. 10 amino acids) w/ some minimum identity score before a dot is assigned On the left is a window of length 1 aa, on the right length 10, minimum identity score 3

Internal repeats w/in BRCA2 protein Dot-plot Windows can be set for different applications - exon size in DNA, repeat motifs in amino acid sequences or length of secondary structure Scoring can also be more subtle than 0/1 identity scores depending on the types of residues compared Here the BRCA2 sequence is compared to itself, left unfiltered, right, filtered w/ window of 30 and minimum score of 5

Alignments can be given scores (e.g. to compare two possible alignments) by different means Substitution matrices can be used to assign individual scores to aligned sequence positions Many different matrices exist, but each assigns different values for all possible pairs of residues Matrices can be based on theoretical considerations, but the most successful are based on empirical data gathered by comparison of known homologous sequences Scoring alignments

BLOSUM-62 No one matrix is best for all applications, use depends on the time (evolutionary distance) between sequences and the type of protein Most scoring schemes construct a 20x20 substitution matrix. Each cell represents the likelihood that that particular pair of amino acids will occupy the same position through time Here color reflects similar physicochemical properties Scoring matrices PAM-120

PAM-120 Scoring matrices SEQ1 :T H I S S E Q U E N C E SEQ2 :T H A T S E Q U E N C E SCORE:5 8-1 1 4 5 5 0 5 6 9 5 “U” represents an unknown residue The overall score, S, for the alignment equals 52

PAM-120 Scoring matrices Different matrices are based on different sets of observed amino acid substitution frequencies First set constructed by Margaret Dayhoff and co-workers in 1960s/1970s Original comparisons used very similar sequences so that alignments would be unambiguous

PAM-120 PAM matrices PAM units - Point Accepted Mutations, accepted point mutations per 100 residues The matrix is a PAM matrix PAM250 = 250 mutations have been fixed on average between 100 residues = many residues w/ more than one mutation - distant relationships

BLOSUM-62 BLOSUM matrices BLOSUM, BLOck SUbstitution Matrix, matrices developed in the 1990s using local multiple alignments not global alignments - a large set of aligned, highly conserved, short regions from analysis of protein-sequence database SWISS-PROT

BLOSUM-62 BLOSUM matrices Matrix was calculated for changes between clustered groups of closely related proteins w/out use of phylogenetic trees Different matrices vary the percentage identity cut-off for clustering, BLOSUM-62 derived using threshold of 62%

Which matrix to use depends on the question being asked Within PAM matrices, the number represents evolutionary distance - larger, greater distance Within BLOSUM, the number represents the percentage identity - larger, greater similarity When aligning distantly related sequences, PAM250 or BLOSUM5-50 may be preferable, PAM120 and BLOSUM-80 for more closely related sequences Choice of Matrices

Some matrices also incorporate additional information - STR matrix includes information about protein structure and can be used with very distantly related sequences Other matrices are specific for different types of proteins - SLIM (ScoreMatrix Leading to Intra-Membrane) and PHAT (Predicted Hydrophobic and Transmembrane matrix) are designed from/for membrane proteins (not soluble proteins) As of 2006, 94 matrices in GenomeNet Choice of Matrices

Homologous sequences are often different lengths - indels - and alignment requires gaps Adding gaps will decrease an alignment score by a “gap penalty” Indels rarely happen in structures of fxal importance, more likely at the ends, and are generally more than a single residue - gap extension penalty is less than gap penalty The best alignment is generally the one that returns the maximum score for the smallest number of introduced gaps Inserting Gaps

Alignment programs generally allow the user to vary the gap penalty If the penalty is set high, few gaps will be introduced - good for closely related sequences low penalty and more gaps are introduced - good for distantly related sequences The most appropriate gap penalty may also vary depending on the substitution matrix being used Gap score can also vary w/ the type of residue, some aa very rarely change (i.e. tryptophan) Inserting Gaps

Gap Penalties Alignments of two distantly related proteins phosphatidylinositol-3-OH and protein kinase

Gap Penalties 1st alignment, gap penalty set high, low in 2nd. In both, end gaps are not penalized With this small amount of identity expert knowledge of protein structure and fx can be helpful

Global alignment - alignment of entire sequences, generally possible with closely related sequences Local alignment - alignment of parts, or domains, of a sequence, possible w/ more distantly related sequences, or sequences in which different regions have different evolutionary histories (multi-domain proteins) In some cases, local alignments can be used as a 1st step toward a global alignment Local and Global Alignments

Local vs global alignment of bovine PI3-kinase p100 and the cAMP-dependent kinase The two share structural homology in catalytic domain but very little sequence homology Note that the global alignment fails to identify the homologous region Local and Global Alignments

Pairwise and Multiple Alignments Alignments can be made between 2 sequences, pairwise, or many sequences, multiple alignments. Multiple alignments can resolve ambiguities in alignments and illustrate sequence conservation over evolutionary time Also generally require more computing power and more sophisticated algorithms

Alignments can be used to locate and identify a gene in a new genome, identify the possible function of a new sequence or novel gene, or find a given gene in a specific taxa Searches have to be sensitive enough to detect distant similarities and avoid false-negative searches and specific enough to reject unrelated sequences, false-positives Verification of homology of identified matches is generally required Databases

Database searching is essentially the same as a pair-wise alignment BLAST, Basic Local Alignment Search Tool, software for searching databases of molecular sequences for regions of similarity to a query sequence BLAST searches for regions of local alignment = isolated regions in seq. pairs that have high levels of similarity BLAST report ranks “hits” in order of statistical significance using an E-value E-values are not the same as p-values, but do approximate them when small BLAST Search

BLAST, Basic Local Alignment Search Tool, is widely used to find core similarity using a window of preset size (a “word”) and a certain minimum density of matches (DNA) or amino-acid similarity score blastp - compare amino acid query with protein-sequence database blastn - compare nucleotide query with nucleic acid sequence database blastx - compare all translation of a nucleotide sequence w/ protein database tblastn - compare protein query with translated nucleic acid sequence database tblastx - compares all 6 frame translations of nucleotide sequence w/ all 6 frame translations of the nucleotide database BLAST

Amino acid searches are easier than nucleotide - but data is often nucleotide Quality of results depends on appropriate algorithms - and well-maintained databases Alignments are given scores - looking for alignments w/ higher score than would be expected from a random match Can estimate the probability of two random sequences aligning with a score ≥ S, the expectation or E-value The E-value is the number of alignments w/ a score of at least S that would be expected by chance alone in searching a complete dataset of n sequences, from 0 to n An E-value of 3 means you would expect 3 such matches, 10-29 means very few Scoring a Search

Low-complexity regions in a protein, e.g. simple repeats, biased amino acid composition, will lead to false matches of proteins or domains that are unlikely to be homologous These regions are generally removed from a search Low-Complexity

BLAST search BLASTn of zebrafishTpiB

BLASTp of zebrafish TPIB BLAST search

Sequence Analysis