D2DBT9 - Genetic Analysis and Bioinformatics

D2DBT9 - Genetic Analysis and Bioinformatics Bioinformatics of Proteins in One and Three DimensionsDr. Jaume Bacarditjaume.bacardit@nottingham.ac.uk

Learning outcomes • To gain practical experience at using protein-related web-based biological databases and extracting information from them • To gain practical experience at using web-based protein structure prediction public services • Having basic knowledge about how to use protein visualisation tools • Have basic practical experience about how to perform homology modelling

Protein we are going to use today… • We are going to use in most examples the AXR4 protein from Arabidopsis Thaliana MAIITEEEEDPKTLNPPKNKPKDSDFTKSESTMKNPKPQSQNPFPFWFYFTVVVSLATII FISLSLFSSQNDPRSWFLSLPPALRQHYSNGRTIKVQVNSNESPIEVFVAESGSIHTETV VIVHGLGLSSFAFKEMIQSLGSKGIHSVAIDLPGNGFSDKSMVVIGGDREIGFVARVKEV YGLIQEKGVFWAFDQMIETGDLPYEEIIKLQNSKRRSFKAIELGSEETARVLGQVIDTLG LAPVHLVLHDSALGLASNWVSENWQSVRSVTLIDSSISPALPLWVLNVPGIREILLAFSF GFEKLVSFRCSKEMTLSDIDAHRILLKGRNGREAVVASLNKLNHSFDIAQWGNSDGINGI PMQVIWSSEASKEWSDEGQRVAKALPKAKFVTHSGSRWPQESKSGELADYISEFVSLLPK SIRRVAEEPIPEEVQKVLEEAKAGDDHDHHHGHGHAHAGYSDAYGLGEEWTTT

Biological databases • Uniprot • NCBI Entrez • Pfam

UniProt • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • Main protein data base • http://www.uniprot.org/

Querying UniProt with a protein Name: AXR4 Uniprot ID

Included in the AXR4 page…. • Annotation of protein • Function, location, specificity, disruption phenotype • Gene Ontology • Sequence • Transmembrane potential • Bibliographic references • Cross-references to other databases • GenBank, PIR, KEGG, TAIR (Adapbidopsis-specific)

Scrolling down through the AXR4 page…. If we click here….

Blasting the AXR4 sequence…

Returns these results • Now we select the most closer homologs and press align

Aligning the homologs (ClustalW)

ClustalW also generates phylogenetic trees

Not only in Uniprot we have protein information…. • The NCBI’s Entrez system returns this for AXR4

Pfam: sequence-based detection of protein families

Pfam returns three possible sequence motifs (but no significant results)

Protein Data Bank (PDB) Put your PDB ID Here Each protein in PDB is identified by a 4-letter code

Entry 2p31 Let’s click at display PDB file

PDB file for 2p31 Sequence Atomic coordinates of the amino acids

Other Biological Databases

Prediction sites • Secondary Structure Prediction • Prediction of residue’s structural aspects • Tertiary structure prediction • Transmembrane prediction • Functional sites prediction • These servers perform very complex calculations • They sometimes take a day or two (or more) to reply • Generally users are notified by email when the results are ready

PSIPRED: Secondary Structure Prediction

Results of PSIPRED…

3D structure prediction • 3D Jury is a Meta-server for 3D PSP

Results of 3D-Jury • Good source of templates

Results of 3D Jury (scrolling to the right)

LOMETS • The quick server from the Zhang group • Zhang’s I-Tasser is the best publicly available PSP server • Unfortunately it is very overloaded (for AXR4 it took 8 days to return a model • LOMETS performs fold recognition using several locally installed programs • Generates homology modelling from the alignments obtained in the FR process • Another good source of distant templates

LOMETS results

mGenTHREADER Prediction results • More templates !!

Other 3D PSP servers • FUGUE • 3D-JIGSAW • Hhpred • SAM-T08 • ROBETTA (David Baker’s server. Heavily overloaded too) • Results of CASP8 (to see how these servers perform)

Infobiotic.net PSP server • Created here in Nottingham • It predicts a broad variety of residue’s structural aspects

Results from the Infobiotic.net server

Firestar:Functional sites prediction

TMHMM: Transmembrane prediction

PyMOL • One of the best protein visualisation tools • Free for educational use • Your can ask for a license at http://www.pymol.org/educational.html • I have a license, so if you would like to use it in your personal computers, you can download it from http://www.cs.nott.ac.uk/~jqb/pymol-1_1edu1-bin-win32.zip • I also have the Linux and MacOS versions • Please, do not distribute it 

Let’s downalod 2p31 and open it from pymol

Controls are at the top right of the screen • A control (all) affects everything loaded into pymol • Also, you can control each loaded protein/selection individually. Right now there is only one protein (2p31) • Five types of controls: • Actions, Show, Hide, Label and Colour

To change to a cartoon visualisaton… • 2p31  Hide  Everything • 2p31  Show  Cartoon • 2p31  Colour  Spectrum  Rainbow • Now click on the middle of the screen, drag the mouse and this is what you obtain….

Visualising only chain A • As we saw in the PDB web site, this protein has two chains • To visualise only one of them, we have to create a selection • You have to type this at the pymol prompt: • PyMOL>select chainA, 2p31 and chain A • chainA is the label of the selection • Everything after the comma is the definition of the selection • We can select chains, residues and even atoms • Type “help selection” to see all possible options

Visualising only chain A • All  Hide  Everything • chainA  Show  Cartoon • chainA  Color  Spectrum  Rainbow • chainA  Action  Zoom

Showing the protein surface • chainA  Show  Surface • Type this: set transparency=0.5

Simple Homology Modelling • We are going to use Modeller • Free for academic use • http://salilab.org/modeller/9v6/modeller9v6.exe • Licence key: MODELIRANJE • 1st step: Installing it. • When choosing the destination path, choose c:\temp (in B08/B09) • Modeller is a very sophisticated tool where you can controll almost any aspect of the homology modelling process • Here we are only going to use the simplest options

Chain we are going to model ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKEFGPSHFSVLAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWKYLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL T0388 LOC493869A, Homo sapiens CASP target ID

1st step: BLAST against PDB

Selecting the template • The perfect match exists, because right now the structure for this target is already public • We are going to ignore it, and use chain A of protein 2p31 instead

2nd step: Creating an alignment • Modeller has a sophisticated alignment tool • Uses structural information from the template • Dynamic programming instead of the approximate method of blast • To create the alignment you need to: • Download the PDB file of the template • Put your sequence in PIR format (example) • Edit the alignment script to set the template and chain • Call modeller: mod9v6.exe align.py

PIR file • Just replace the sequence with your own one • The last line in the sequence needs to end in * • Do not touch anything else from the file, or the alignment script will not work >P1;target sequence:target:::::::0.00: 0.00 ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKE FGPSHFSVLAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLV DSSKKEPRWNFWKYLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL*

Align.py from modeller import * from modeller.automodel import * env = environ() aln = alignment(env) template='2p31' chain='A' tc=template+chain mdl = model(env, file=template, model_segment=('FIRST:'+chain,'LAST:'+chain)) aln.append_model(mdl, align_codes=tc, atom_files=template+'.pdb') aln.append(file='target.ali', align_codes='target') aln.align2d() aln.write(file='target-'+tc+'.ali', alignment_format='PIR') aln.write(file='target-'+tc+'.pap', alignment_format='PAP') Just change the value of these 2 lines with your template

Results of the alignment • Alignment is different from that produced by BLAST • Modeller has ignored the residues lacking structural information _aln.pos 10 20 30 40 50 60 2p31A -----Q----DFYDFKAVNIRGKLVSLEKYRGSVSLVVNVASECGFTDQHYRALQQLQRDLGPHHFNV target ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKEFGPSHFSV _consrvd * ** * * ****** * ********* * ** * * * ** ** * _aln.p 70 80 90 100 110 120 130 2p31A LAFPCNQFGQQEPDSNKEIESFARRTYSVSFPMFSKIAVTGTGAHPAFKYLAQTSGKEPTWNFWKYLV target LAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWKYLV _consrvd ********* ** ** ***** * * ** * ** * *** * * *** ******** _aln.pos 140 150 160 170 2p31A APDGKVVGAWDPTVSVEEVRPQITALVR---------- target NPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL _consrvd * * ** * * * ** * ****

Creating the model from modeller import * from modeller.automodel import * log.verbose() env = environ() template='2p31' chain='A' tc=template+chain class MyModel(automodel): def get_model_filename(self,sequence, id1, id2, file_ext): return sequence+'_'+`id2`+file_ext def special_restraints(self, aln): rsr = self.restraints a = MyModel(env, alnfile='target-'+tc+'.ali', knowns=tc, sequence='target', assess_methods=(assess.DOPE, assess.GA341)) a.starting_model = 1 a.ending_model = 5 a.make() • 5 models are created • Each of them can be slightly different • Models are going to be assessed using 2 different criteria

D2DBT9 - Genetic Analysis and Bioinformatics