710 likes | 837 Views
C IAVVL C LVFM S V E VV GG I K A NS LAIL T D AA H LL S D VAAFAI S LF S L W AA G W E A T P R QTY G FF R I E IL G ALV S I Q LI W LL T ALFLLI NT A Y MVV E FVA G FM SNS L G LI S D A CH MLF D C AALAI G L Y A SY I S R LPA NHQYNY G R G R F E VL S G Y V N AVFLVLV G
E N D
CIAVVLCLVFMSVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWAAGWEATPRQTYGFFRIEILGALVSIQLIWLLTCIAVVLCLVFMSVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWAAGWEATPRQTYGFFRIEILGALVSIQLIWLLT ALFLLINTAYMVVEFVAGFMSNSLGLISDACHMLFDCAALAIGLYASYISRLPANHQYNYGRGRFEVLSGYVNAVFLVLVG CFVVVLCLLFMSIEVVCGIKANSLAILADAAHLLTDVGAFAISMLSLWASSWEANPRQSYGFFRIEILGTLVSIQLIWLLT LIAVLLCAIFIVVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWASGWKANPQQSYGFFRIEILGALVSIQMIWLLA ---IFLYLIVMSVQIVGGFKANSLAVMTDAAHLLSDVAGLCVSLLAIKVSSWEANPRNSFGFKRLEVLAAFLSVQLIWLVS Computational analysis of membrane proteins implicated in metal transport in Arabidopsis thaliana Stefanie Hartmann Max Planck Institute for Molecular Plant Physiology Supervisors: Joachim Selbig, Ute Krämer
12 membrane proteins involved in metal transport in Arabidopsis
Metal transporters are of great importance because… …they provide an adequate supply of essential trace metals …they prevent an excess of these potentially toxic ions in silico analyses may help design further experiments on • basic research on metal homeostasis • development of new ways of phytoremediation
Cation Diffusion Facilitator (CDF) proteins also referred to as cation efflux (CE) proteins • occur in archaea, bacteria, eukaryotes • are involved in transporting heavy metals (Co2+, Cd2+, Zn2+, Ni2+) • the CDF family of proteins had 13 members in 1997 • the CE Pfam family today has 348 members (July 2003) • CDF signature sequence: 426 (Jan 2004) SX(ASG)(LIVMT)2(SAT)(DA)(SGAL) (LIVFYA)(HDN)X3DX2(AS)
The Arabidopsis thaliana CDF protein family CDF1: At2g46800S LAILTDAAHLLS D VAA CDF2: At3g61940 S LAILADAAHLLT D VGAexact match CDF3: At3g58810 S LAILTDAAHLLS D VAA CDF4: At2g29410 S LAVMTDAAHLLS D VAG CDF5: At2g04620 S LGLISDACHMLF D CAA1 mismatch CDF6: At2g47830 S TAIIADAAHSVS D VVL CDF7: At2g39450 S LAIIASTLDSLL D LLS CDF8: At1g16310 S MAVIASTLDSLL D LLS2 mismatches CDF9: At1g79520 S MAVIASTLDSLL D LLS CDF10: At3g58060 S IAIAASTLDSLL D LMA CDF11: At3g12100 R VGLVSDAFHLTF G CGL CDF12:At1g51610 S HVIMAEVVHSVAD FAN4 mismatches 3 mismatches
Research questions: secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins?
Research questions: secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences?
Research questions: secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions fold recognition by threading Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences? Is it possible to predict the 3D structure of these proteins?
Sequence retrieval - four ambiguous sequences TIGR Arabidopsis thaliana database TAIR: The Arabidopsis Information Resource MIPS Arabidopsis thaliana genome database • different assignment of introns, use of alternative start codons Sequence analysis - three additional ambiguous sequences SWALL Pfam vs. TIGR/TAIR/MIPS • insertions and deletions, different amino acid sequence Cloning and RT-PCR revealed correct sequences for six of the seven ambiguous CDFs
Hidden Markov models used for secondary structure prediction cytoplasmic side membrane non-cytoplasmic side • states (loops, transmembrane domains, etc) are defined • states are connected in a biologically reasonable way (transitions) • each state has a specific probability distribution over the 20 amino acids • each transition has a specific transition probability • amino acid probabilities and transition probabilities are learned • models are first taught using a training set, the trained model is then • used for the prediction
Results of secondary structure predictions (14) TMHMM v2 (Tusnady and Simon, 1998, 2001) HMMTOP v2 (Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000)
Results of secondary structure predictions (14) TMHMM v2 (Tusnady and Simon, 1998, 2001) HMMTOP v2 (Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000)
CDF signature CE signature
Prediction of subcellular localization mTP: mitochondrialcTP: chloroplastSP: signal peptide targeting peptide transit peptide (ER/secretory pathway)
Prediction of subcellular localization - methods • N-terminal sorting signals display characteristic amino acid compositions • sequence-based methods predicting N-terminal sorting signals are based • on this observation mTP: mitochondrialcTP: chloroplastSP: signal peptide targeting peptide transit peptide (ER/secretory pathway)
Prediction of subcellular localization - results mTP: mitochondrialcTP: chloroplastSP: signal peptide targeting peptide transit peptide (ER/secretory pathway)
Exon structure of the CDF proteins # of exons 1 1 1 1 1 9 12 13 6 6 7 5
Gene organization of the CDF proteins CDF1 CDF2 CDF3 CDF4 CDF5 CDF11 CDF6 CDF12 CDF7 CDF8 CDF9 CDF10
Phylogenetic Relationships within Cation Transporter Families of Arabidopsis Plant Physiology 2001; 126 (4): 1646–1667 omitted: CDFs 5, 7, 8, 9 CDF6 CDF11 CDF4 CDF10 CDF12 CDF3 CDF2 CDF1
Phylogenetic analysis of sequences containing the CE signature Arabidopsis group I sequences, monocot and dicot sequences, mammalian metal transporters Arabidopsis group II sequences, monocot and dicot sequences, prokaryotic and eukaryotic seqs several two-domain proteins outgroup
working model: topology of Arabidopsis CDF proteins CDF signature sequence cell exterior/organelle cytoplasm N C
Information derived from the 3D structure of a protein assignment of function guide mutagenesis- experiments ligand and functional sites evolutionary relationships residue solvent exposure putative interaction sites
Structure determination Classical approaches Computational approaches • X-ray crystallography • NMR spectroscopy • comparative (“homology”) modeling • fold recognition (“threading”) • ab initio methods
The basis of fold recognition (“threading”) The number of folds occurring in nature is limited: There are many sequences with no significant sequence identity but with the same or similar folds PDB statistics: http://www.rcsb.org/pdb/holdings.html …HEAIDHKPKLTGMKTGRVVSSMKSNFFADLP… …HDGRSSMTRFSRYFRKTGRVSEYYKKQERLLE…
Fold recognition methods aim: to find an optimal sequence-structure alignment “threading” of an unknown target sequence into the backbone structure of template proteins of known structure ………CLVFMSVEVVGGIKANSLAILTD………
4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials Output: a list of folds (sorted or unsorted), their “compatibility score”, sometimes other information such as SCOP descriptors, alignment, rudimentary 3D model of the query protein, raw scores, solvation energy for the model, links
No new insights regarding the structure of CDF proteins Membrane proteins are significantly under-represented in structural databases – and therefore also in fold libraries If there is no fold similar to the native fold of the target protein, this approach cannot succed. Threading methods cannot be used for modeling of transmembrane proteins
Will the 3D structure of CDFs be available soon? • for fold recognition methods to be used successfully: significantly more 3D structures of membrane proteins are needed fold recognition methods specifically for integral membrane proteins may eventually be developed cyrystallization of bacterial homologs and subsequent extraploation of structural features as an alternative? approach for globular proteins: predicting a protein’s solubility and propensity to crystallize, based on results from high-throughput structure determination • • •
1 2 3 4 5 Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? Which hits were common to 2. “Phylothreading” which of the CDF sequences?
1 2 3 4 5 Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? Which hits were common to 2. “Phylothreading” which of the CDF sequences?
Which hits were common to which of the CDF sequences? • Structural hits predicted • • for most CDF sequences • for group I sequences • for group II sequences • for CDF5 and CDF11 • for CDF6 and CDF12 • Results were unable to provide evidence to verify group assignments based on other methods
“Phylothreading” Phylothreading results can neither verify nor refute group assignments based on other methods
Threading: non-transmembrane CDF fragments cell exterior/organelle cytoplasm N C N-terminus C-terminus histidine-rich loop between TMD 4 and 5
“Phylothreading”: CDF C-terminal fragments “phylothreading” results confirm the assignment of CDF sequences to groups that were based on independent methods
Conclusions • The 12 Arabidopsis protein sequences reveal structural and therefore probably functional conservation • My results support the classification of these proteins as CDF metal transporters • I propose that the CDF protein family of A. thaliana contains two groups, each containing at least four proteins that are structurally and functionally closely related • Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) • Threading results for multiple sequences may be used to confirm (or find?) relationships among these sequences (“phylothreading”) • I was able to evaluate and compare a number of online tools that are available for the analysis of sequence data
Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5. I was able to evaluate and compare a variety of online tools available for the analysis of sequence data
Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5. I was able to evaluate and compare a variety of online tools available for the analysis of sequence data 6. Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) 7. Threading results for multiple sequences can be used to confirm (or find?) relationships among these sequences (“phylothreading”)
Phylogenetic analysis: tree-building methods • distance-based methods overall distance between all pairs of sequences are calculated and then used to calculate a tree (Neighbor Joining) • character-based methods the individual substitutions among the sequences are used to determine the most likely ancestral relationships (Maximum Parsimony, Maximum Likelihood) • Bayesian inference of phylogenies ...CLVFMSVEVVGGIKANSLAILTD... ...NTAYMVVEFVAGFMSNSLGLISD... ...CLLFMSIEVVCGIKANSLAILAD... ...CAIFIVVEVVGGIKANSLAILTD... ...YLIVMSVQIVGGFKANSLAVMTD...
Phylogenetic analysis: statistical evaluation of trees • bootstrap analysis how much support exists for particular branches in a phylogeny? tree construction, determination of the “best” tree bootstrap datasets (pseudosamples) are created from the original dataset by random sampling with replacement tree construction using the bootstrap datasets comparison of the bootstrap tree with the inferred tree this is repeated several hundred times bootstrap value: percentage of times an interior branch in the bootstrap tree was the same as the one in the inferred tree ...CLVFMSVEVVGGIKANSLAILTD... ...NTAYMVVEFVAGFMSNSLGLISD... ...CLLFMSIEVVCGIKANSLAILAD... ...CAIFIVVEVVGGIKANSLAILTD... ...YLIVMSVQIVGGFKANSLAVMTD...
Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure • using environment-based mean force potentials (Bowie, Fischer, Eisenberg: 1991-1996) - residue positions are categorized into environment classes - the 3D protein structure is converted into a 1D sequence - generate alignment of this 1D string to target sequence • using knowledge-based mean force potentials (Sippl: 1990-1995) - information is automatically learned from databases of protein structures - pairwise interactions between structurally adjacent residues are calculated - transformation of mean force potentials as a function of distance
Fold recognition methods aim: to find an optimal sequence-structure alignment “threading” of an unknown target sequence into the backbone structure of template proteins of known structure query sequence ………CLVFMSVEVVGGIKANSLAILTD……… fold library