380 likes | 493 Views
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences. if a related sequence has a known function can you inherit functional properties if a related sequence has a known structure, can you model the unknown structure using the known?
E N D
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences • if a related sequence has a known function can you inherit functional properties • if a related sequence has a known structure, can you model the unknown structure using the known? • structural information can often provide additional clues as to the function • What are the best methods to use? • What thresholds should be used for safe inheritance of functional properties?
Homologues are related sequences: a duplication a b paralogs speciation a b a b orthologs species 1 species 2
Protein Sequence and Structure Databases • GenBank sequence database in the States has over 120 million sequences - some partial. More than a million non-identical sequences • DNA database of Japan (DDBJ) • UniProt (SWISS-PROT) database has > a million non-identical sequences - validated gene sequences • Protein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries
Web Based Public Resources containing Functional Annotations • Protein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRAD • Databases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC • Databases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex • Species Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc
Evolution of Protein Sequences • substitutions due to single base mutations • insertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loops • insertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps where there are indels
Evolution of Protein Sequences Sequence A Sequence B
a a a b b b Human Hemoglobin: Alpha and Beta Chains a VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT b VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH TPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH
a a a b b b Human Hemoglobin: Alpha and Beta Chains a VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT b VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH LPAEFTPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH
Percentage Sequence Identity = number of identical residues X 100 number of residues in smallest protein For globin example without gaps ~9% with gaps ~41%
Searching for Homologues with Related Functions • How do you handle the evolutionary changes? • How similar do the sequences need to be to inherit structural and functional properties • How do you cope with the volume of data ie millions of sequences to search?
Searching Sequence Databases Do fast scans using approximate methods e.g.BLAST or PSIBLAST Align proteins carefully using a dynamic programming methodNeedleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g.Pfam, InterPro, Gene3D Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT Can you inherit functional information?
Dot Plots, Path Matrices, Score Matrices Sequence A V T R I V H V N S I L P S T N I L S V I L S T R I Sequence B V I L P E F S T diagonal lines give equivalent residues
Sequence A V T R I V H V N S I L P S T N I L S V I L S T R Sequence B I V I L P E F S T identical residues score 1 highest scoring path across the matrix gives best alignment
Sequence A V I L S L V I L P Q R S L V V I L S L V I L A L T V S T V I L S L V R N V I L P Q R I L S L V I S L A L Sequence B runs (tuples) of 3 residues 3 6 6 3 3 5 6 6 gap penalty = 3 SCORE = 20 - 9 = 11 3
Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%
Dynamic Programming Methods Needleman & Wunsch Global alignment Smith & Waterman Local alignment
Sequence A Sequence B
Significance of sequence similarity – length dependence 40 Sequence identity (%) 20 Homologous pairs 0 length 0 200 400 • protein pairs having > 150 residues are homologous if the sequence identity is > 25% • short proteins/fragments of 20-40 residues - 30% sequence identity frequently occurs by chance
If proteins are homologous they are likely to have similar structures and functions….. Sequence identity between homologues required for inheriting structure or function: • Modelling a structure based on the structure of a homologue >= 30% • Inheriting functional properties from a homologue >= 60% The structures of proteins in a family tend to be much more highly conserved during evolution than the sequences (and, in some families, the function)
Residue Substitution Matrices a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues Identity Matrix • simplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0) • score residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores low • score residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices Physicochemical Properties Matrix Evolutionary Matrices
Dayhoff Matrix (PAM or MDM) • based on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) • the method measures evolutionary distance by determining the number of point accepted mutations, where: 1PAM = a single point mutation every 100 residues for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix for database searches generally use 120 PAMS
BLOSUM Substitution Matrices Henikoff & Henikoff (1993) • matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity) • for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequences • use the alignments to calculate residue substitution frequencies
Which Matrix Should be Used? • Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical properties • various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms • in database searching it may be better to use PAM120 or BLOSUM62
BLASTBasic Local Alignment Tool Altschul et al (1990) • A highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming
BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores
BLAST Step 2:compare the word list to the database and identify exact matches
BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming
BLAST • searches for 2 non-overlapping segments on same diagonal • must be within a certain distance of each other before extension is invoked • can also allow gaps so that the method joins segments on different diagonals
Assessing the Significance of Sequence Match • length - can get artificially high scores between small sequences • composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins • to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences • if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences
Assessing the Significance of Scores Returned from a Database Scan S - m frequency s.d mean probe score S score Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences
BLAST results BLAST best hit >gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X 10-50 Possible Match 1 X 10-50 to 1 X 10-2
Needleman & Wunsch A H C N I R Q C L C R P M A 1 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 1 0 0 0 0 0 0 0 0 C 0 0 1 0 0 0 0 1 0 1 0 0 0 I 0 0 0 0 1 0 0 0 0 0 0 0 0 N 0 0 0 1 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 1 0 0 0 0 1 0 0 C 0 0 1 0 0 0 0 1 0 1 0 0 0 K 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 1 0 0 0 0 1 0 1 0 0 0 R 0 0 0 0 0 1 0 0 0 0 1 0 0 H 0 1 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0
Needleman & Wunsch Algorithm • Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below it • find the highest scoring path in the matrix by: • starting in the top left corner • moving down across the matrix from cell to cell • choosing the highest scoring cell at each move • the path can not go back on itself or cross the same row or column twice
Accumulating the Matrix • Add to the score in the cell the highest score from a cell in the row or column to right and below i,j i-1,j-1 i-n,j-1 i-1,j-m
Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 I 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 I 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 5 4 3 3 3 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 Sequence B C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 H 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0
Possible Moves in Finding a Path across the Matrix • start in the leftmost or topmost row • move to the highest scoring cell in row or column to right and below i,j i-1,j-1 i-n,j-1 i-1,j-m
Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 I 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 I 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 5 4 3 3 3 1 0 0 Sequence B R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 H 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0
Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 I 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 I 6 6 6 5 6 4 4 3 3 2 1 0 0 Sequence B N 5 5 5 6 5 5 4 3 3 3 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 H 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M