Protein Sequence Comparison: Tools and Methods for Better Understanding

CS177 Lecture 13Review/Summary of the Madej lectures Tom Madej 12.06.04

Overview • Basic biology. • Protein/DNA sequence comparison. • Protein structure comparison/classification. • NCBI databases overview. • Miscellaneous topics.

Lodish et al. Molecular Cell Biology, W.H. Freeman 2000

Protein/DNA sequence comparison • What is the meaning of a sequence alignment? • Scoring methods; amino acid substitution matrices, PSSMs. • Basic computational methods; e.g. BLAST. • Know how to run PSI-BLAST, interpret the results.

Homology “… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.” E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003

A simple phylogenetic tree…

Human hemoglobin and more distantly related globins • Human and horse • Human and fish • Human and insect • Human and bacteria

Alignment notation: different notations for the same alignment! VISDWNMPN-------MDGLE CILVV----AANDGPMPQTRE VISDWnm---pnMDGLE CILVVaandgpmPQTRE

Computing sequence alignments • You must be able to recognize the “answer” (correct alignment) when you see it (scoring system). • You must be able to find the answer; i.e. compute it efficiently.

Scoring and computing alignments • “Position independent” amino acid substitution tables; e.g. BLOSUM62. • Global alignment algorithms such as Smith-Waterman (dynamic programming); or fast heuristics such as BLAST.

Score this alignment: VISDWnm---pnMDGLE CILVVaandgpmPQTRE Use: BLOSUM62 matrix; gap opening penalty 10; gap extension penalty 1 (-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27

BLAST (Basic Local Alignment Search Tool) • Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman. • Method of choice for database searches. • Statistical theory for significance of results (extreme value distribution). • Heuristic; does not guarantee optimal results. • Many variants, e.g. PHI-, PSI-, RPS-BLAST.

Why database searches? • Gene finding. • Assigning likely function to a gene. • Identifying regulatory elements. • Understanding genome evolution. • Assisting in sequence assembly. • Finding relations between genes.

Issues in database searches • Speed. • Relevance of the search results (selectivity). • Recovering all information of interest (sensitivity). • The results depend on the search parameters, e.g. gap penalty, scoring matrix. • Sometimes searches with more than one matrix should be performed.

E-values, P-values • E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database. • P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database. • E-values are easier to interpret than P-values. • If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.

PSI-BLAST • Position Specific Iterated BLAST • As a first step runs a (regular) BLAST. • Hits that cross the threshold are used to construct a position specific score matrix (PSSM). • A new search is done using the PSSM to find more remotely related sequences. • The last two steps are iterated until convergence.

PSSM (Position Specific Score Matrix) • One column per residue in the query sequence. • Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column. • There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.

Two key advantages of PSSMs • More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions. • Describes the important motifs that occur in the protein family and therefore enhances the selectivity.

Position Specific Substitution Rates Weakly conserved serine Active site serine

Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile

PSI-BLAST key points • The first PSSM is constructed from all hits that cross the significance threshold using “standard” BLAST. • The search is then carried out with the PSSM to draw in new significant hits. • If new hits are found then a new PSSM is constructed; these last two steps are iterated. • The computation terminates upon “convergence”, i.e. when no new sequences are found to cross the significance threshold.

Protein structure comparison/classification • Protein secondary structure elements. • Supersecondary structures (simple structure motifs). • Folds and domains. • Comparing structures (VAST). • Superfolds. • Fold classification (SCOP). • Conserved Domain Database (CDD).

α-helix (3chy) backbone atoms with sidechains

Parallel β-strands (3chy)

Anti-parallel β-strands (1hbq)

Higher level organization • A single protein may consist of multiple domains. Examples: 1liy A, 1bgc A. The domains may or may not perform different functions. • Proteins may form higher-level assemblies. Useful for complicated biochemical processes that require several steps, e.g. processing/synthesis of a molecule. Example: 1l1o chains A, B, C.

Supersecondary structures • β-hairpin • α-hairpin • βαβ-unit • β4 Greek key • βα Greek key

Supersecondary structure: simple units G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Supersecondary structure: Greek key motifs G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Protein folds • There is a continuum of similarity! • Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing. • Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).

Vector Alignment Search Tool (VAST) • Fast structure comparison based on representing SSEs by vectors. • A measure of statistical significance (VAST E-value) is computed (very differently from a BLAST E-value). • VAST structure neighbor lists useful for recognizing structural similarity.

Superfolds (Orengo, Jones, Thornton) • Distribution of fold types is highly non-uniform. • There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of “isolated” fold types. • Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions. • It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution?

Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin α-up-down 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3 Trefoil 1i1b interleukin-1β; 1aaiB ricin; 1tie erythrina trypsin inhibitor TIM barrel 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco OB fold 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit α/β doubly-wound 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY Immunoglobulin 2rhe Bence-Jones protein; 2cd4 CD4; 1ten tenascin UB αβ roll 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G Jelly roll 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin Plaitfold (Split αβ sandwich) 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidine-containing phosphocarrier Superfolds and examples

SCOP (Structural Classification of Proteins) • http://scop.mrc-lmb.cam.ac.uk/scop/ • Levels of the SCOP hierarchy: • Family: clear evolutionary relationship • Superfamily: probable common evolutionary origin • Fold: major structural similarity

Bioinformatics databases • Entrez is by far the most useful, because of the links between the individual databases, e.g. literature, sequence, structure, taxonomy, etc. • Other specialty databases available on the internet can also be very useful, of course!

PubMed abstracts Taxonomy Genomes Nucleotide sequences Links Between and Within Nodes Word weight Computational 3 -D Structures 3-D Structure VAST Phylogeny Computational Protein sequences BLAST BLAST Computational Computational

Entrez queries • Be able to formulate queries using index terms (Preview/Index), and limits.

Exercises! • How many protein structures are there that include DNA and are from bacteria? • In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000? • Notice that the results are not 100% accurate! • In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse?

P53 tumor suppressor protein • Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer. • Mutations in p53 are found in most tumor types. • p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.

G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.

Exercise! • Use Cn3D to investigate the binding of p53 to DNA. • Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).

Miscellaneous topics • BLAST a sequence against a genome; locate hits on chromosomes with map viewer. • Obtain genomic sequence with map viewer. • Spidey to predict intron/exon structure. • How sequence variations can affect protein structure/function.

“EST exercise” summary • BLAST the EST (or other DNA seq) against the genome. • From the BLAST output you can get the genomic coordinates of any nucleotide differences. • Use map viewer to locate the hit on a chromosome; assume the hit is in the region of a gene. • By following the gene link you can get an accession for mRNA. • By using the “dl” link you can get an accession for the genomic sequence. • Use “spidey” with the mRNA and genomic sequence to locate changed residues in the protein.

“EST exercise” summary (cont.) • From the gene report you can follow the protein link, and then “Blink”. • From the BLAST link page you can get to CDD and related structures. • Since you know where are the changed residues you can use the structures to study what effect the changes might have on the function of the protein.

Gene variants that can affect protein function • Mutation to a stop codon; truncates the protein product! • Insertion/deletion of multiple bases; changes the sequence of amino acid residues. • Single point change could alter folding properties of the protein. • Single point change could affect the active site of the protein. • Single point change could affect an interaction site with another molecule.

Protein Sequence Comparison: Tools and Methods for Better Understanding

Protein Sequence Comparison: Tools and Methods for Better Understanding

Presentation Transcript

PPT Presentation

Download the Powerpoint Presentation - PowerPoint Presentation

talk-ppt - PowerPoint Presentation

PowerPoint Summary of: ADR

Summary of lectures

Summary of the last lecture

Review Lectures

Summary of Presentation

Presenting Lectures with PowerPoint

Summary of the presentation

Review of the Previous Lectures

Summary of the Last Lecture

Summary of the last lecture

CS177 Lecture 4 Sequence Alignment

Summary of Presentation

Summary of the previous lecture

Summary of previous lectures

Summary of the Last Lecture

Summary of the Last lecture

Summary of the Last Lecture

DECAL lectures summary

Lecture 13 - Review