Hidden Markov models for detecting remote protein homologies

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous

Brief outline • Description and evaluation of a new hidden Markov model method, SAM-T98, for finding remote homologs of protein sequences. • Evaluation with three fold-recognition test datasets and a curated database. • Comparison with WU-BLASTP and DOUBLE-BLAST • Results

1.Biology background • Homologs: chromosomes carrying the same genetic loci; • Structure of a protein can be predicted by using a homology to sequences for which the structure is known. • Similar structures similar functions • Can classify proteins into families with similar functions. • Remote-homolog Detection

2.Statistics background • Hidden Markov Models • Q=set of states={match, insert, delete} • V=output alphabet={20 amino acids} • (i)=probability of being in state i, at time t=0 • A=transition probabilities={aij}, where aij=Pr[entering state j at time t+1| in state i at time t] • B=output probabilities={bj(k)}, where bj(k)=Pr[producing vk at time t | in state j at time t]

HMMsas profile models • Homologs are chromosomes carrying the same genetic loci; a diploid cell has 2 copies of each homolog, one derived from each parent. • A profile of a protein family is a labeling of the positions of the amino acids in the secondary structure and a probability distribution for each position. • The structure of a protein can be predicted by using a homology to sequences for which the structure is known. Proteins with similar structure assumed to have similar functionclassification of proteins into families according their function.

Typical Profile HMM: • chain of match, insert and delete states. • Specific probabilities to all transitions between nodes and character costs in match and insert states. • BEST PATH: a single path from ‘Start’ to ‘End’ and each character is related to a successive match or insertion state along the path.

Example

For this work: • Local alignment procedure was used: relates part of the sequence to one contiguous path through part of the HMM. • An HMM is trained on sequences-members of protein family, the resulting HMM identifies the positions of amino acids which describe structure of family. use this HMM to discriminate family members.

TEST SETSFold recognition datasets • FSSP: based on a protein classification tree (Holm and Sander, 1996, 1997)  presents a continuously updated structural classification of 3-dimensional protein folds (sequences of 1050 leaves of the FSSP tree, 166 target sequences). • Uses DALI to determine structural homology. • Classification: z-scores > 6  homologs z-scores < 2  non-homologs 2 < z-scores < 6  partly yes or no

Even with the best possible classifier, about 2% of the non-self pairs represent homologies to be detected. • At the minimum-error point for an optimal classifier, there are about 1.4% homolog pairs.

SCOP:(Structural Classification of proteins) • 2 test sets (Brenner, 1996; Park et al., 1997): identical lists for both target lists • Database of known folds • Homologous pair: if both sequences were in the same SCOP superfamily. • No 2 sequences had >40% sequence similarity. • Whole-chain test set: .6% correct homologies • Domain test set: the same

Sequence comparison dataset • Pearson:curated version of the PIR database ( Barker et al., 1990). • 12 216 sequences total • Set of 67 target sequences • .4% were considered correct • PIR families (Protein Information Resource): close homologs  Pearson test is for close homolog classification, NOT remote homolog.

ALGORITHMS • WU-Blast(Basic Local Alignment Tool): • Protein identification software for genes • Set E (# of false positives) = 10 • Log of P-value reported as the score to threshold. • Optimum threshold never corresponded to P-value > .005.

DOUBLE-Blast: Inspired by ISS (Park et al., 1997), ISS was used to recognize remote revolutionary related sequence pairs derived from SCOP database. It considered to increase detection compared to FASTA. • Two-step approach: 1. Set of close homologs found to the target sequence in NRP. 2. Each homolog is used as a query to search the final database.

SAM-T98: Single target sequence: finds and multiplies align a set of homologs and creates an HMM from that multiple alignment. Resulting HMM is used then for database search. • SAM package • Database small: method is used to create an HMM for each sequence in the database. • For fold-recognition tests created HMMs for all sequences. • Pearson test only for the 67 target sequences.

Hidden Markov models for detecting remote protein homologies

Hidden Markov models for detecting remote protein homologies

Presentation Transcript

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models