410 likes | 564 Views
Computational Structural Biology. T.A.: Naama Amir E-mail: naamaamir@mail.tau.ac.il Schreiber 10 Homework (probably 4) Acceptance hour – very flexible. (Available after the tutorial) Most important: Ask a lot of questions! . Know your protein. Exercise 1: Databases presentation .
E N D
Computational Structural Biology • T.A.: Naama Amir • E-mail: naamaamir@mail.tau.ac.il • Schreiber 10 • Homework (probably 4) • Acceptance hour – very flexible. • (Available after the tutorial) • Most important: Ask a lot of questions!
Know your protein Exercise 1: Databases presentation
Presented Databases: • UniProt- main sequence database • SwissProt • Tremble • NCBI- lots of databases, including sequence and structures • RCSB- the Protein Data Bank- all deposited structures • PDBsum- combines structural & sequence data
UniProt The Universal Protein Resource • The world's most comprehensive catalog of information on proteins • Sequence, function & more… • Comprised mainly of the databases: • SwissProt – 412525 last year, 538010 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. manually annotated and reviewed. • TrEMBL – 17651716 last year, 23994583 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database many proteins are poorly annotated since only automatic annotation is generated
UniProt • Annotation description includes: • Function(s) of the protein; • Posttranslationalmodification(s) such as phosphorylation, acetylation and GPI-anchor; • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, • Secondary structure, e.g. alpha helix, beta sheet; • Quaternary structure, i.g. homodimer, heterotrimer, etc.; • Similarities to other proteins; • Diseases associated with any number of deficiencies in the protein; • Sequence annotationas Sequence conflicts, variants, etc
UniProt • Connected to many other databases (e.g EC, PdbSum, PDB (to be discussed…)) • Each sequence has a unique 6 letter accession • Entries in SwissProt also have IDs, which usually make sense (e.g. CADH1_HUMAN for a cadherin of humans) • Download sequence in FASTA format
UniProt • FASTA format for protein sequences: >P05102|MTH1_HAEPH Modification methylaseHhaI MIEIKDKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEK EGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREKKPK VVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMIC RNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTTPKTV LGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMG PDSYKVHPSTSQAYK QFGNSVVINVLQYIAYNIGSSLNFKPY
UniProt http://www.uniprot.org/ Type accession: P05102 Or ID: MTH1 _HAEPH
UniProt General data: name, origin, EC (enzymatic reaction)…
UniProt Functional data, including the GO annotations Scroll down to find the sequence & download the FASTA
UniProt Functional data, including the GO annotations Scroll down to find the sequence & download the FASTA
UniProt Known sites, predicted/known secondary structures, Natural variation or mutagenesis
UniProt The protein’s sequence in FASTA format Download Send to BLAST (will be discussed later on in the course)
UniProt References for all info in the page- important to take a look…
UniProt Connections to other databases Other sequence database, e.g. genebank Related structures in the PDB (if available) Model-structure in the ModBase database- automatically derived! All sorts of domain\motifs databases- The family related to the entry
NCBI National Center for Biotechnology Information • biomedical and genomic information. • Many sorts of databases:e.g. biomedical literature, sequences, and protein structures.
NCBI • Public databases
NCBI • Public databases • GenBank -NIH genetic sequence database, a collection of all publicly available DNA sequences. • RefSeq- The Reference Sequence collection- comprehensive, integrated, non-redundant, well-annotated sequences: genomic DNA, Transcripts, proteins. • ~6,413,124 protein entries • Protein–compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.
NCBI • http://www.ncbi.nlm.nih.gov/ e.g. UniProt accession The protein database
NCBI • NCBI- http://www.ncbi.nlm.nih.gov/ GI- the unique ID given to each sequence UniProt accession, not ID! PDB entries
RCSB – Protein Data Bank (pdb) • The main & comprehensive database for biological macro-molecular structures • Each structure receives a PDB ID: a 4 letters unique identifier • Search by author, PDB id or any keyword. • Download structures
RCSB – Protein Data Bank (pdb) • RCSB- Protein Databank http://www.rcsb.org PDB ID: 10mh
RCSB – Protein Data Bank (pdb) • Protein Data Bank Download structure The paper describing the structure Display structure
RCSB – Protein Data Bank (pdb) • Protein Data Bank Download structure The paper describing the structure Display structure
RCSB – Protein Data Bank (pdb) • PDB files have a specific format: • HEADER – pdb code and deposited date • TITLE – the paper title. • REMARK • JRNL- reference • HELIX, BETA- secondary structure • ATOM – The actual protein/DNA/RNA chain • HETATM- additional atoms such as ligands, water etc. • MODEL/ENDMDL • … http://www.wwpdb.org/documentation/format3.1-20080211.pdf
RCSB – Protein Data Bank (pdb) Atom number PDB files have a specific format: ATOM 7 SD MET A 1 -29.059 28.614 71.539 1.00 26.90 S ATOM 8 CE MET A 1 -27.535 29.074 70.866 1.00 16.57 C ATOM 9 N ILE A 2 -29.656 32.903 69.094 1.00 25.93 N ATOM 10 CA ILE A 2 -30.077 33.171 67.730 1.00 25.49 C HETATM 3139 C6 SAH 328 -11.642 26.514 89.489 1.00 17.97 C HETATM 3140 N6 SAH 328 -10.474 26.661 90.103 1.00 14.50 N HETATM 3141 N1 SAH 328 -11.895 25.334 88.899 1.00 23.10 N HETATM 3142 C2 SAH 328 -13.079 25.090 88.350 1.00 16.93 C HETATM 3143 N3 SAH 328 -14.120 25.887 88.278 1.00 16.05 N HETATM 3144 C4 SAH 328 -13.832 27.092 88.861 1.00 14.31 C HETATM 3145 O HOH 329 -29.525 42.890 90.934 1.00 24.84 O HETATM 3146 O HOH 330 -28.213 42.867 93.588 1.00 8.11 O HETATM 3147 O HOH 331 -24.619 35.287 96.173 1.00 17.96 O B-factor Atom, residue or molecule Coordinates: X, Y,Z Residue number Chain if exists http://www.wwpdb.org/documentation/format33/sect9.html#ATOM
RCSB – Protein Data Bank (pdb) Resolution: a measure of the underlying data quality. High-resolution structures have low values. R-value: Measures the quality of the atomic model obtained from the crystallographic data. Again the lower the better. Typical values are about 0.20.
PdbSum • A database providing an overview of all biological macromolecular structures • Connected to UniProt find the sequence accession of a known PDB ID • Detailed description of many structure properties, e.g.: • ECnumber ( Enzyme Commission number) • Chains & ligands and their interactions • Secondary structure • FASTA sequence of structure… • …
PdbSum http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/ PDB ID Free text Search by sequence
PdbSum Useful tabs UniProt accession Chains & ligands
PdbSum Protein tab Secondary structure- from the PDB
PdbSum Ligand tab The ligand’s structure
Databases presentation • Summary • UniProt – UniProt accession or SwissProt ID. • NCBI- UniProt accession, SwissProt ID, GI for NCBI or by free text. • RCSB- search by PDB id, (or by free text) • PDBsum- search by PDB id, UniProt accession, (or by free text…)
Databases presentation • Buzz words from this exercise • Protein-FASTA format, pdb file, resolution, ligand, chain, uniprotaccession,SwissProtID, GI number… • Databases - RCSB, PDB, PDBsum, UniProt, TrEmbl, SwissProt, NCBI
Questions? GOOD LUCK!