620 likes | 866 Views
Beta-Barrel Discrimination. Babak Alipanahi Prof. Ming Li CS882-Fall 2006. Outline :. A tale of two barrels Membrane proteins A review of β -barrels Folding Mechanism Seven families, some examples Literature Review What I have done What will I do…. Two Kinds of Closed Barrels.
E N D
Beta-BarrelDiscrimination Babak Alipanahi Prof. Ming Li CS882-Fall 2006
Outline: • A tale of two barrels • Membrane proteins • A review of β-barrels • Folding Mechanism • Seven families, some examples • Literature Review • What I have done • What will I do…
Two Kinds of Closed Barrels • There are two kinds of closed barrels • α/β barrels (Globular) • β barrels (Transmembrane) • These two types of proteins are similar in the way that in both types (Branden 99) • Similar structures have very different a.a. sequences • The function of protein is determined by the loops and not by strands or helices (α/β barrels only). (Actually, all strands and helices are only needed to form the barrel and usually β strands and α helices are structurally equivalent) • They are different in the way that • In α/β barrels, β strands are parallel and are connected to each other by α helices while in β barrels they are anti-parallel and are connected to each other by (usually) simple loops • They have a very fundamental difference (actually this is the important difference between all transmembrane and globular proteins ). I will come back to this later…
An example of α/β Barrel (Branden 99) • In the right picture, β-Core of Glycolate Oxidase (8 β-stranded α/β barrel which is in an enzyme) is depicted. Note that all β-strands are parallel • The eight-stranded α/β barrel is one of the largest and most regular of all domain structures • At least 200 a.a. are required for formation of this structure • Most of them are enzymes with completely different a.a. sequences and diverse functions
An example of α/β Barrel (cntd.) • As it can be seen, parallel β strands are connected to each other by α helices • Eight β strands enclose a tightly packed hydrophobic core formed entirely by β strands side chains • The active site in all α/β barrels is formed by loops at one end of the barrel
β-Barrels • β-barrel proteins are found in the outer membranes of Gram-negative bacteria, mitochondria and chloroplasts (Schulz 00) • It has been hypothesized that possibly most of integral outer membrane proteins of mitochondria and chloroplasts are β-barrels because these are relics of their evolutionary history as symbiotic intracellular Gram-negative bacteria (Wimley 03) • Abundant mitochondrial voltage-dependent anion channel (VDAC) has been long been thought to be a β-barrel (Wimley 03)
Membrane Proteins • Hallmark of Gram-negative bacteria is their cell envelope which has two membranes (inner and outer, called IM and OM respectively) separated by periplasm (Ruiz 05) Image from Nature
Membrane Proteins • The structure, function, and composition of IM and OM is dramatically different. IM is in direct contact with cytoplasm and periplasm while OM is in contact with extracellular environment (Ruiz 05) Image from Nature
Analysis of E. coli cell envelope: IM (Ruiz 05) • IM, which is the major permeability barrier between cell’s inside and outside (Tamm 04), is a bilayer composed of phospholipids (PL) and proteins: • Integral IM proteins: Span the IM with α-helical transmembrane domains • Lipoproteins: Anchored to outer leaflet of IM by lipid modifications of the N-terminal • All of the membrane-bound biochemical process that occur in eukaryotic cells such as oxidative phosphorylation, lipid biosynthesis and protein translocation, occur in IM (Ruiz05). In other words, most membrane-associated metabolic functions are carried out in IM (Tamm 04) • It should be noted that surface of integral IM proteins is less hydrophobic than OM proteins and they have less complex folding mechanism (Tamm 04)
Analysis of E. coli cell envelope: Periplasm (Ruiz 05) • 10% of the cell volume is occupied by periplasm that is comprised of soluble proteins and peptidoglycan layer. Periplasm is an oxidizing environment and contains enzymes that catalyse formation of disulphide bonds • Periplasm is ATP free, so all the activities are done in absence of an obvious energy source • Peptidoglycan functions as an extracytoplasmic cytoskeleton and prevents cell from lysing in dilute environments
Analysis of E. coli cell envelope: OM(Ruiz 05) • OM is unique in a sense that unlike most other eukaryotic and prokaryotic membranes ,it is asymmetric. Upper and lower leaflets composed of mainly LPS1 and PL respectively • OM functions as a selective barrier and inhibits entry of toxic and unwanted molecules which is a crucial task for bacterial survival in many (possibly hostile) environments. For example, E. Coli is resistant to bile salts which helps bacteria to live in intestines • There are two kinds of proteins in OM: • Lipoproteins: 90% of lipoproteins are in OM • β-barrels: These are called OM proteins (OMP). Some of them act as channel. Since the membranes are impermeable to hydrophilic solutes; these channels are necessary for nutrient intake and excretion of toxic waste products (we will revisit OMPS diverse functions later) 1: Lipopolysacharide
Barrel Construction Principles (Schulz 00) • “The number of β strands is even and both N and C terminal are at the periplasmic barrel end” • “The β -strand tilt is always around 45° and corresponds to the common β-sheet twist. Only one of the two possible tilt directions is assumed, the other one is an energetically disfavored mirror image” • “All β strands are anti-parallel and connected locally to their next neighbors along the chain, resulting in a maximum neighborhood correlation” OmpX, a defense protein which is a toxin binder Image from Schulz 00
Barrel Construction Principles (cntd.) • “The shear number of an n-stranded barrel is positive and around n+2, in agreement with the observed tilt” • “The strand connections at the periplasmic barrel end are short turns of a couple of residues named T1, T2 and so on” • “At the external barrel end, the strand connections are usually long loops named L1, L2 and so on” Images from (Waldispühl 06) with complete modifications
Barrel Construction Principles (cntd.) • “The β -barrel surface contacting the nonpolar membrane interior consists of aliphatic side chains forming a nonpolar ribbon with a width of about” 27 Å (Tamm 04) • “The aliphatic ribbon is lined by two girdles of aromatic side chains, which have intermediate polarity and contact the two nonpolar–polar interface layers of the membrane” • “The sequence variability of all parts of the β barrel during evolution is high when compared with soluble proteins” • “The external loops show exceptionally high sequence variability and they are usually mobile”. “The loops exhibit the largest sequence variability and thus contain the most of functional characteristics of each protein…” (Tamm 04) Image from (Wimley 02) with complete modification
β-Barrels folding mechanism (Tamm 04) • Folding and membrane insertion of OmpA • Unfolded state U hydrophobically collapses intro intermediate water soluble state IW • This intermediate chain binds to membrane and forms intermediate state IM1 • IM1 proceeds to intermediate state IM2 or molten disk. Some part of β-strands are formed in this state • Next, four Trps on the four beta hairpins move to center of bilayer (intermediate state IM3) • IM3 is more globular and is called molten globule but still has not reached its native tertiary structure • Folding and membrane insertion are coupled processes • Membrane interface is involved in the folding Blue balls are Tryptophan (Trp) in the above image. Technique used for finding these steps is Time-resolved Trp Fluorescence Quenching (TDFQ) Image from (Tamm 04)
Assisted folding of β-Barrels (Tamm 04) • As told before periplasmic region is ATP free, so during the evolutionary process, mechanisms have been devised that let OMPs spontaneously insert into OM after being translocated to periplasm • Two periplasmic proteins have been proposed for helping β-barrels folding process: • Skp is a soluble protein that can also bind to phospholipid bilayer. Three or four Skps bind to a newly synthesized and unfolded OMP immediately after it is translocated through IM and act as a passive chaperon (remember that periplasmic region is ATP free) and prevent aggregation. But this protein does not assist folding process • SurA is a periplasmic peptidyl-prolyl isomerace that has been shown to assist the folding of OMPs. Experiments show that “Sequences containing aromatic-random-aromatic motifs bind particularly to SurA”. It has a long 50 Å docking cleft for accommodating unfolded peptide chains
Features of OMPs • Nearly 2~3% of genes in Gram-negative bacteria genomes encode β barrels. In E. Coli genome, 60 proteins are annotated as known or probable OMPs (Wimley 03) • Average length of β-strands is 11 a.a. residues in trimeric porins and 13-14 residues in monomeric β-barrels (Tamm 04) • Regarding the 40~45° tilt of β-barrels from membrane normal, the average rise per residue is 3.8*sin(45) which is 2.7 Å rise per residue (Tamm 04) • Most OMPs lack Cysteines so no possible disulphide bonds in the OMPs
Features of OMPs (cntd.) • Interior facing TM β-strands of β-barrels are rich in small and polar a.a. such as glycine (Gly) threonine (Thr), serine (Ser), asparagine (Asn) and glutamine (Gln). (Tamm 04), (Wimley 03) • 40% of lipid exposed residues are aromatic (Wimley 03), also aromatic residues tyrosine (Tyr) and tryptophan (Trp) are abundant in loop regions (Tamm 04) Images from (Wimley 03)
Six families of OMPs (based on Tamm 04) • General Porins: porins typically control the diffusion of small metabolites like sugars, ions, and amino acids • Passive Transporters: these proteins are selective passive transporters of maltose, sucrose and fatty acids • Active Transporters of Siderophores and Vitamin B12: They receive their energy through interaction with IM proteins • Enzymes: proteases and phospholipases • Defensive Proteins: fight hostile molecules • Structural Proteins: membrane anchors • Toxins (non-constitutive): kill target cell
Some examples of OMPs • Name: OmpA • β-Strands: 8 • Oligometric State: monomer • Organism: E. Coli • Residues: 171 • Function: Structural protein • Features: • The residues inside the barrel are so tightly packed that lumen inside is filled with polar side chains that interact with each other through some Hydrogen bonds and electrostatic reactions. Groups of water molecules are also can be found in the lumen • They link the outer membrane to the periplasmic peptidoglycan, in other words they are some kind of membrane anchors • “Extensive mutagenesis studies show that OmpA is quite robust agianst many mutations especially in the loop, turn and bilayer facing area.” Surprising fact is that transmembrane spanning domain of OmpA “can even be circularly permutated without impairing its assembly and functions” (Tamm 04)
Some examples of OMPs • Name: FepA • β-Strands: 22 • Oligometric State: monomer • Organism: E. Coli • Residues: 724 • Function: iron transporter (active transporter) • Features: FepA which is a TonB-dependent active Fe-siderophore transporter, uses metabolic energy through interaction with IM proteins. C-terminal forms the β-barrel domain while the N-terminal forms a hatch domain that plugs the barrel and regulates iron transport (Tamm 04), (Wimley 03)
MspA: a very long porin • Name: MspA • β-Strands: 8x2 • Oligometric State: octamer • Organism: M. smegmatis • Residues: 184 • Function: mycobacterial porin • Features: It has two sequential β-barrels of different diameter, the narrow barrel which has a hydrophobic surface which is 37Å long, because mycobacteria’s membrane do not contain LPS but very long mycolic fatty acids. It should be noted that members of mycobacteria cause tuberculosis (Tamm 04) Bottom image from (Tamm 04)
TolC: involved in multi-drug resistance • Name: TolC • β-Strands: 3x4 • Oligometric State: trimmer • Organism: E. Coli • Residues: 428 • Function: active export channel • Features: TolC is a small molecule transporter that is involved in multi-drug resistance of bacteria (it facilitates drug efflux (Bigelow 04)). It derives its energy from its interactions with IM proteins. Lumen of β-barrel is connected to the lumen of an α-helical bundle that extends through periplasm to IM (i.e. a direct path to cytoplasm) (Wimley 03), (Tamm 04)
OmpLA: an enzyme Active site • Name: OmpLA • Β-Strands: 12 • Oligometric State: dimmer • Organism: E. Coli • Residues: 269 • Function: enzyme • Features: Phospholipase OmpLA is only active in the dimmer form. Active site is at the outer edge of barrels and in the interface between two barrels. It role is possibly hydrolyzing the PL that have migrated to extracellular leaflet of OM, where normally they should not be there (Tamm 04), (Wimley 03)
α-Hemolysin : a deadly toxin • Name: TolC • β-Strands: 7x2 • Oligometric State: heptamer • Organism: S. aureus • Residues: 293 • Function: toxin • Features: This toxin is secreted as monomeric protein that ultimately forms a 14-stranded β-barrel with each monomer contributing a β-hair pin to the heptamer. After insertion into the victim cell’s membrane, they form an ungated pore that leads to osmotic cytolysis. Note that how clean is the pore (Wimley 03), (Tamm 04)
Β-barrel discrimination: Literature review • The research done on β-barrels can be categorized into two major groups (both of them rely only on a.a. sequence): • Secondary structure (herein after: S.S.) prediction • Discrimination of β-barrels from globular and IM proteins • Usually, most methods for secondary structure prediction also provide a side-kick algorithm for discrimination because: • Unlike globular (water soluble) proteins that have a hydrophobic core and a hydrophilic surface, β-barrels have a hydrophilic core (interior wall of lumen) and a hydrophobic surface (lipid exposed) • Two very similar β-barrels can have very different sequences that do not show even little signs of homology • Discrimination accuracy of α-helical TM proteins from non- α-helical TM proteins is very high (99% accuracy is reported) because of their unique features (Hirokawa 98)
Some definitions • After a.a. sequence is feed into discrimination algorithm, it determines whether it is an OMP (positive) or not (negative). A positive answer, can be true (true positive, TP) or false (false positive, FP). likewise a negative answer can be true (true negative, TN) or false (false negative, FN). So, we define: • TP: # of correctly classified OMPs • TN: # of correctly classified non-OMPS • FP: # of non-OMPs classified as OMP • FN: # of OMPs classified as non-OMP
Some definitions (cntd) • Sensitivity (SEN): fraction of OMPs correctly discovered by the algorithm. this shows the ability to correctly predict OMPs (Park 05) • Specificity (SPC): fraction of correctly discovered OMPs. This shows the ability to reject non-OMPS (Park 05) • A dumb algorithm that declares every input to be OMP will have sensitivity of 100% and specificity of 0%! • Some people really cheat! we will see…
Some definitions (cntd) • Overall accuracy (ACC) is very useful for determination of overall performance, but it is not enough. Our dumb algorithm will have a 50% accuracy! (assuming # of OMPs and non-OMPs are the same)
Some definitions (cntd) • Matthews correlation coefficient (MCC) is a very powerful measure of performance. It is zero for completely random algorithms (our dumb algorithm’s MCC is zero) and a perfect algorithm’s MCC is one (Park 05)
Prediction approaches (1) • Profile-based HMMs: HMM is trained by sequence profiles computed from a multiple sequence alignment. Two major studies are • (Martelli 02): A very successful and highly cited research. In this study, every residue can be either loop or β-strand. Discrimination is done by calculating posterior probability of sequence based on the given model. S.S. prediction accuracy is 84% , discrimination accuracy (ACC) is 84% and false positive rate is 10% (SEN=90%) • (Bigelow 04): The algorithm, PROFtmb, is mainly based on (Martelli 02) with some modifications, like having four state for each residue: up-strand, down-strand, periplasmic- loop and outer-loop. S.S. prediction accuracy is 86% , SPC=100% and SEN=45%
Prediction approaches (2) • (Zhai 02): in β-barrel finder (BBF), hydropathy and amphipathicity values are used for discrimination. A sliding-window of size seven residues is used to calculate hydropathy and amphipathicity values for all a.a. in the protein sequence. Since the resulting function is noisy, it is averaged over multiple aligned sequences. They claim that every TM β-strand corresponds in position to a peak of hydropathy and one of amphipathicity
Prediction approaches (3) • (Waldispühl 06): This method, uses pairwise interstarnd residue statistical potential derived from globular proteins for prediction of super-secondary structure of OMPs. transFOLD algorithm employs a generalized HMM (multi-tape S-attribute grammar (MTSAG)) to describe potential β-barrel structure and then computes the minimum free energy by dynamic programming • They claim that unlike other approaches, they consider long range interactions between residues • S.S. prediction accuracy is 79% but rate of correctly predicted structures is 93% • For OMP discrimination, they use four parameters: sequence length, folding pseudo-energy in water-filled and non-water-filled lumen model and overall hydrophobicity. Discrimination is performed by SVM. SEN=88% and SPC=63% and ACC=75%
Prediction approaches (4) • Neural Network based (Jacoboni 01): This work has been cited many times and is highly appreciated as one of the first reasonably good prediction methods • A feed-forward neural network is implemented and trained using the error back-propagation algorithm for discrimination of β-strands from extra membrane regions (i.e. a two state prediction, β-strand or non-β-strand) • Evolutionary information is given as input in form of sequence profile after multiple-sequence alignments • S.S. prediction accuracy is nearly 78%
Methods based on peptide and dipeptide composition • In these methods, abundance of single a.a. or a.a. pairs is used for discrimination of OMPs • It has been shown that a.a. and a.a. pair composition is reasonably different in OMPs and non-OMPs • Methods using a.a.composition as classification features, have much better performance in comparison to methods using other features such as hydrophobicity or posterior probability in HMM-based methods • With these features at hand, several techniques have been applied for classification such as k-nearest neighbors (k-NN), SVM, simple a.a. weighting and neural network
Methods based on peptide and dipeptide composition (cntd) • a.a. abundance in lipid exposed and barrel interior (Wimley 02): in this research a clever observation made that the relative abundance of a.a. (relative to whole genome) in interior and lipid exposed areas are very different • If we show a lipid exposed a.a. by E and barrel interior a.a. by I, a β-strand will have this pattern: • …EIEIEIEIEIE… Images from (Wimley 03)
(Wimley 02) (cntd.) • In Aj+i is I assumption, it is assumed that a.a. j+i in sequence is barrel interior facing a.a. so it will be scored based on barrel interior a.a. relative abundance table and vice versa • It has been assumed that β-strand length is 10 but this is not so realistic • No performance measure is given
Methods based on peptide and dipeptide composition (cntd) • k-NN: (Garrow 05) in TMBhunt, features are comp(i) values. For a new query, its k nearest neighbors are found (by calculating the Euclidian distance) and by majority vote, its class is identified. Performance is reinforced by including differentially weighted a.a., evolutionary information and by calibrating the scoring system. SEN=91%, SPC=93.8% and ACC=92.5% (these results were doubted in (Park 05) to be 89.2 %) • sum-of-deviations: (Gromiha 04) in this study, the average comp(i) in all proteins for each class (OMP or non-OMP) is computed. For a new query, comp(i) values are computed and the absolute value of deviation comp(i) from each class is computed. The query is of the type that has less total deviation from (They could use Euclidian distance which is more meaningful). SPC=80%, SEN=84%
Methods based on peptide and dipeptide composition (cntd) • sum-of-deviations: (Gromiha 05-a) this study is virtually the same is the previous one but the new algorithm works only with averaged dipeptide abundance values (dipep(i,j)). For a new query, dipep(i,j) values are computed (400 values) and then weighted with regard to pre-calculated dipeptide abundance difference table for OMPs and non-OMPs (only globular proteins). Finally the decision is made based on the sign of the summation of weighted terms. SEN=94.7%, SPC=79.2% and ACC=84.8%. Major problem of this method is that training data has not been filtered for homologous sequences giving overestimated results • Neural-Network: (Gromiha 05-b) discrimination method is exactly the same as (Gromiha 04) but they have introduced neural network for S.S. prediction that has a prediction accuracy of 73.2%
Methods based on peptide and dipeptide composition (cntd) • SVM: (Park 05) (note: Gromiha is the second author!) sequences used for training are filtered by all-to-all sequence similarity check using CD-HIT (Li 01) that produces a non-redundant protein data base. They used SVM with radial basis function (RBF) kernel for discrimination. This study is actually the first organized study with well-defined definitions and representation of results • They use composition values (xC means that x comp(i) values have been used for discrimination) and dipeptide values (yD means y dipep(i,j)values has been used). x and y are found using backward and forward feature selection algorithms • I have defined some notations for ease of results presentation • OMP: outer membrane proteins • TMH: trans membrane α-helices proteins • GLB: globular proteins • NOM: non-outer membrane protein • So, OMP-TMH classification means discrimination of OMP and TMH proteins
Results of SVM-peptide composition method • Results are better than any previous methods but are far from the accuracy rates for TMH set (99%) • It is interesting that the discrimination between OMP and NOM (which is TMH+GLB) is less than each of OMP-TMH and OMP-GLB. Also, OMP-TMH has the highest discrimination rate
What I have done: 1-Data Set • The data set I have used is the same as study done by (Park 05) which has been shown that be one of the most comprehensive and challenging data sets that contain • 208 non-homologous OMPs • 206 non-homologous TMHs • 673 non-homologous GLBs that consist of • 155 all α proteins • 156 all β proteins • 184 α+β proteins • 179 α/β proteins • For finding the optimal features, I first started with a.a composition ratios (20C), then added sequence length (L) and finally I found that β-strand score (B) (as defined in (Wimley 02)) can enhance the performance
β-strand quality factor • I have assumed that mean β-strand length is 12 because it is the best choice for covering all β-barrels (including newly discovered ones) • β-factor is calculated (and is called B feature) by summing squared values of β-strand quality factor for all residues
What I have done: 2-Feature Selection • There is a very useful and usual scaling insensitive measure for linear classification that can give some information even for non-linear classification called Fisher Discrimination Ratio (FDR) which is defined as (Park 05):
3-Algorithms used for prediction • I have used several algorithms for classification including: • Support Vector Machine (SVM): SVM with radial basis function (RBF) kernel • Locally Linear Neurofuzzy Model (LLNM): LLNM with locally linear model tree (Lolimot) model construction method • Neural Network: multi-layer perceptron (MLP) feed-forward network with error back propagation learning algorithm • The prediction accuracy is nearly the same for all algorithms so none has clear advantage over the others, however since SVM is much faster, I have chosen it • A very possible danger when using powerful algorithms is overfitting that that destroys the generalization capability. When training dataset is small, overfitting is a fatal risk • To avoid overfitting, usually n-fold cross validation is used specially when the training data set is small • Data set is divided into n subsets, at each step algorithm is trained by n-1 subsets and validated by the remaining 1 subset. This process is repeated for all n subsets and performance is averaged over all n experiments