Predicting cellular localization

Predicting cellular localization Bioe C144/C244 Fall 2010

Eukaryotic protein localization

Why localize? • Subcellular localization is a key functional characteristic of proteins. • To co-operate for a common physiological function (metabolic pathway, signal transduction cascade, structural associate etc.), proteins must be localized in the same cellular compartment. http://mendel.imp.univie.ac.at/CELL_LOC/

Correct Localization is required for pathway/complex formation • A set of many co-operating proteins is responsible for a physiological function (metabolic pathway, signal transduction cascade, structural associate etc.). • Subcellular localization is an essential characteristic for this level. • For proper functioning, the protein has to be translocated to the correct intra- or extracellular compartments in a soluble form or attached to a membrane. http://mendel.imp.univie.ac.at/CELL_LOC/

Computer-Aided Approaches for the Assignment of Subcellular Localization • Automatic, computer-aided selection methods are clearly the only way to identify interesting attractive target proteins among the haystack of new gene sequence data. • One of the helpful decision criteria is the probable subcellular localization of the gene products. • For example, in a search for virulence factors of pathogenic bacteria or easily accessible entry points for pharmaceutical drugs, extracellular proteins are good candidates http://mendel.imp.univie.ac.at/CELL_LOC/

Computer-Aided Approaches for the Assignment of Subcellular Localization • For a primary screening of gene sequences, the first step is a general classification into intracellular, extracellular, membrane-related (both with transmembrane regions and with lipid anchors) and viral proteins. • In the case of Eukaryotes, intracellular location is desirable to be further detailed with respect to organelles (mitochondrium, chloroplast, endoplasmatic reticulum and Golgi apparatus, nucleus). http://mendel.imp.univie.ac.at/CELL_LOC/

Predicting subcellular localization by homology with characterized proteins • Subcellular localization can often be assigned by searching for homologous sequences. • This is an easy task for a few new proteins but very difficult for thousands of sequences contained in new genomes. • Even with the most advanced retrieval systems and relying on the well-annotated SWISS-PROT, it is impossible to get exhaustive classifications with respect to subcellular localization. http://mendel.imp.univie.ac.at/CELL_LOC/

Prediction method 2: analysis of sequence properties • First attempts to classify proteins with respect to cellular localization based on amino acid sequence properties Nishikawa and Ooi (J.Biochem. 1982) • amino acid composition, disulphide bonds, the secondary structural class related to function and localization • Early results were promising, but based on a small sample. http://mendel.imp.univie.ac.at/CELL_LOC/

Prediction by signal peptide detection • Some proteins have sequence signals that determine their translocation to organelles or outside the cell • Claros et al. Curr.Op.Struct.Biol. (1997). • These patterns are not clear cut, especially for the intracellular organelle targeting peptides; • prediction accuracy is limited • Nielsen et al. Prot.Eng. (1997) v.10, 1 • Combinations of compositional and signal sequence analyses have been used in expert systems for the prediction of cellular localization • Nakai & Kanehisa Genomics (1992); • In general: not systematic and not rigorously tested http://mendel.imp.univie.ac.at/CELL_LOC/

Extracting information from sequence • Signal peptides: short sequences in the protein used to target the protein for specific cellular compartments. • Signal patches (clusters of amino acids in close proximity in 3D structure, but distant in primary sequence) are also found • Examination of amino acids at structure surface can be particularly helpful; subtle preferences of different amino acids for different environments

Trans-membrane helix prediction

Helical membrane proteins • Key components in cell-cell signalling • Mediate transport of ions and solutes across membrane • Crucial for recognition of self • Major class of drug targets • More than 50% of prescription drugs act on GPCRs (G-protein coupled receptors) • Multi-billion dollar industry

Many predicted; few known • Solved structures available for very few membrane proteins • Predicted 10K helical membrane proteins in human genome (~25% of genome!) Chen and Rost, 2002

Helical membrane proteins challenge bioinformatics • Very little info about 3D structures • Very hard to crystallize • Hardly traceable by nuclear magnetic resonance (NMR) spectroscopy • Relatively easy to identify (rough) location of helices through low-resolution experiments • C-terminal fusion with indicator proteins • Antibody binding Chen and Rost, 2002

Concepts for predicting TM helix location and topology • Hydrophobicity scales provide simple criteria for prediction • TM helices are predominantly non-polar • TM helix length between 12-35 aa • Globular regions between membrane helices typically shorter than 60 aa • “Positive inside rule” von Heijne • Connecting loop regions on inside have more positive charge than loop regions on outside Chen and Rost, 2002

Hydrophobicity scales • Kyte and Doolittle (20 yrs ago) • Hydropathy scale, moving window approach • Window of 19 residues discriminated best between membrane and globular • Other work equally successful • Drawback: methods fail to discriminate between membrane regions and highly hydrophobic globular segments Chen and Rost, 2002

Other clues • Amino acid preferences for membrane and non-membrane proteins • Training data for methods derived from proteins identified as containing TM helices, as well as other secondary structure types • Higher accuracy Chen and Rost, 2002

Including topology helps • TopPred (von Heijne, 1992) • Topology prediction, using hydrophobicity analysis, possible topologies ranked by positive-inside rule • SOSUI (Hirokawa et al, 1998) • Combined KD hydropathy, amphiphilicity, relative and net charges, protein length Chen and Rost, 2002

Including homology helps • Alignment of homologs known to help secondary structure prediction (Rost and Sander, 1993) • Note: for 20-30% of proteins in any genome, no identifiable homologs can be found! • PHDhtm first method using homology info for membrane prediction • Uses neural networks, DP, multiple alignment • “one of the most accurate prediction methods” Chen and Rost, 2002

Including homology helps • TMAP (Persson and Argos, 1996) • Derived amino acid propensities from known TMs • 4-residue caps of membrane helices • 21 residue TM segments • Found at outside of membrane: N D G F P W Y V • Found mostly inside: A R C K • Used these propensities to improve prediction Chen and Rost, 2002

Grammatical rules • TMHMM pioneered building models of predicted membrane proteins in one consistent methodology • Sonnhammer et al 1998, Krogh et al 2001 • Similar concept implemented in HMMTOP • Tusnady and Simon, 1998 • MEMSAT similar to HMMTOP • Jones et al, 1994 Chen and Rost, 2002

Topology questions • The topology of a TM protein indicates its orientation with respect to the membrane: • which regions are outside (extracellular) and which are cytoplasmic • Predicted topologies turn out to be wrong roughly as often as they’re correct… Chen and Rost, 2002

Sequence information aiding TM recognition • Hydrophobic stretches (for lipid bilayer) • “Positive inside rule” • Von Heijne 1986, 1994 • Abundance of positively charged residues • Improved predictions through use of: • sliding windows • Multiple alignment • Neural networks Chen and Rost, 2002

Errors in TM prediction • Under-prediction (False negative) • Over-prediction (False positive) • False merge • two adjacent helices predicted to be one helix • False split • One long helix predicted to be two • Inexact placement of helices Chen and Rost, 2002

Prediction accuracy (1) • Performance accuracy overestimated significantly! • “developers have overrated their methods by 15-50%” Chen et al, unpublished • Why do developers overestimate their method accuracy? • Validation performed on proteins closely related to training sequences (and thus not indicative of performance on novel sequences) Chen and Rost, 2002

Prediction accuracy (2) • “Membrane helices are not entirely conserved across species” • Implies that even related proteins may have different topologies (# TM helices, orientation) and perform different cellular functions • N.B. There is no indication that the authors meant to imply that proteins that are globally alignable have differences in their TM domain locations or numbers • Measures of accuracy of prediction not comparable across methods, due to lack of standard benchmark • Benchmark dataset now available at EBI Chen and Rost, 2002

Chen et al findings • Most TM methods get the number of helices right for most membrane proteins • 86% of TMH residues predicted by best methods • 70-75% of proteins get all TM helices predicted correctly by top methods • Topology correct for only half of all proteins Chen and Rost, 2002

Prediction accuracy (4) • Some papers have claimed that simple hydrophobicity scales are as accurate as more sophisticated methods • Chen et al disagree Chen and Rost, 2002

Prediction accuracy (5) • All methods confuse membrane helices with signal peptides • Best separation provided by ALOM2 (Nakai and Kanehisa) • Optimized to sort proteins into classes of sub-cellular localization Since Rost’s paper, the Phobius server was developed to integrate TM and signal peptide prediction http://www.ebi.ac.uk/Tools/phobius/index.html Chen and Rost, 2002

Prediction accuracy (6) • Most methods wrongly predict membrane helices in globular proteins • Most methods overestimate their ability to distinguish between globular and membrane proteins Chen and Rost, 2002

Emerging and future developments • Improved prediction by averaging over many methods (I.e., consensus approaches) • Promponas and colleagues: CoPreTHi combined 7 methods, requiring 3 to agree • Nilsson et al, 2000, used 5 methods • Accuracy correlated with number of methods agreeing Chen and Rost, 2002

Chen and Rost, 2002 Emerging and future developments • Amphiphilic (aka amphipathic) alpha helix identification can improve prediction • Helical-membrane and signal peptide predictions must be combined explicitly • Best signal peptide prediction tool is SignalP (Nielsen et al 1997) • PSORT, HMMTOP and THHMM integrate these predictions • More thorough combination is still missing Except, of course, for Phobius, released since this paper

Emerging and future developments • Databases of TM proteins being produced and curated • Membrane-specific substitution matrices improve database search for TM proteins • Current substitution matrices based on globular proteins • Henikoff and Henikoff have membrane-helix-specific substitution matrix PHAT Chen and Rost, 2002

Sequence conservation in TM domains • Residues on helix-helix interface tend to be more conserved than those facing the lipid bilayer • Conservation in TM helices greater than structurally variable regions but not as significant as enzyme active sites and other functionally critical regions (KS observation)

More data from structural studies of TM proteins • Solved membrane protein structures have also shown that helical propensities are different in the membrane. • Glycine and proline, which are thought to be helix-breakers in soluble proteins, occur in the transmembrane helices of cytochrome c oxidase • [Tsukihara et al, 1995]. • Studying known structures has revealed that aromatic residues are often in the bilayer interface, possibly anchoring the transmembrane helix in the bilayer • [Pawagi et al, 1994].

More data from structural studies • Serine and threonine can satisfy hydrogen bond donors and acceptors by hydrogen bonding to backbone carbonyls, making membrane localization favorable (Engelman et al 1986) • Analysis of solved membrane proteins show TM length ranges from 14-36 aa (varying due to variations in lipid bilayer width) • Canonical alpha helix prediction methods derived from soluble proteins are not as effective at predicting TM-located helices

TMHMM provides a grammar to parse sequences into subregions

TMHMM author findings • TMHMM correctly predicts 97–98 % of the transmembrane helices. • TMHMM can discriminate between soluble and membrane proteins with both specificity and sensitivity better than 99% • although the accuracy drops when signal peptides are present • This high degree of accuracy allowed authors to predict reliably integral membrane proteins in a large collection of genomes. • Based on these predictions, authors estimate that 20–30 % of all genes in most genomes encode membrane proteins • which is in agreement with previous estimates. • Proteins with Nin-Cin topologies are strongly preferred in all examined organisms • except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for Nout-Cin topologies.

Aspects of model • Specialized modeling of various regions • Helix caps • Middle of helix • Regions close to membrane • Globular domains (all modeled identically) • TM amino acid stats derived from known TM domains

Training data

Signal peptide prediction

Chloroplast transit peptides are hard to detect

A plant GPCR?? Arabidopsis Thaliana GCR2

“only one Arabidopsis putative GPCR protein (GCR1) has been characterized in plants (17–20), and no ligand has been defined for any plant GPCR”

Predicting cellular localization

Predicting cellular localization

Presentation Transcript

Localization and Secure Localization

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks

Localization and Secure Localization

Localization

Localization

Localization

Predicting

Localization

PREDICTING

Localization

Localization

Localization

Localization

Predicting

LOCALIZATION

Localization

Localization

Localization

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks

Localization

Localization and Secure Localization

Localization