1 / 54

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Outline. Magnitudes and Scales Resources: Data Sources & Tools

gwidon
Download Presentation

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk

  2. Outline • Magnitudes and Scales • Resources: Data Sources & Tools • Primary DNA sources • Sequence Repositories • Structure Repositories • Functional Categorization • Integration of Databases • The Human Genome • Genome Browsers • Prediction Tools • Evaluation of Prediction Servers • Starting points • Link collections

  3. Resources: Sources & Tools • There is A LOT OF biomolecular databases/sources • A LOT OF overlap of information/redundancy • A LOT OF TOOLS • Personal picks/preferences • User-friendliness • Update intervals • Curation efforts / error correction • Linkage to other DBs

  4. Faster than Moore’s law...

  5. Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001

  6. Magnitudes and Scales • Human genome 3,200,000,000 bp • Single basepair  full genome is 9 orders of magnitude • Genome = Football field: ~3 billion leaves of grass • Single base A T G C (or SNP) =1 leaf of grass • Genome browsing • Zooming from whole stadium to single leaf

  7. How we got the sequence • Sanger chain termination method

  8. Primary DNA sources • Trace files repositories • Single read: 500-1000 bp (~golf ball size/ jig saw puzzle) • Variable quality • WashU-Merck Human EST Project / Trace files • ”Base-calling” non-trivial

  9. Assembly is Non-trivial!

  10. Sequence repositories - GenBank et al. • GenBank / EMBL / DDBJ • Highly redundant (many versions of same gene) • Cross-updated daily • Version history is recorded • Previous sequence records can be retrieved • Contigs/HTGS (100-200 kb) finishing at different stages • DraftFinished • Includes genomic DNA, cDNA, ESTs, translated peptides

  11. Non-redundant and Curated databases • Non-redundant • Manual or automatic curation • DNA • RefSeq (NCBI; semi-automated) • Ensembl gene index (automated) • Protein • RefSeq (NCBI; semi-automated) • TrEMBL (EMBL; automated)

  12. Curated database: UniProt/SwissProt • SIB - Swiss Institute of Bioinformatics • Protein Knowledgebase / Sequence Database • Highly curated • Experimental evidence evaluated (e.g. modifications) • All 80,000 entries checked by Amos Bairoch himself ;-) • ExPASy - Expert Protein Analysis System • Proteomics tools: links + local servers

  13. Structure databases / Protein Data Bank (PDB) • X-ray , NMR biomolecular structures • Protein Data Bank (PDB) • >22,000 structures(April 2003) • http://www.rcsb.org/pdb/

  14. Functional Categorization • Gene Ontology (GO) • Hierarchical • Controlled vocabulary

  15. Functional Categorization • Gene Ontology (GO) http://www.geneontology.org/ • Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase • Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

  16. Integration of databases - Webs of web-sites • Links, links, links... • SRS = Sequence Retrieval System • Powerful, complex query language • BioDAS – Distributed Annotation System http://srs.ebi.ac.uk/

  17. For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)

  18. GeneCards http://nciarray.nci.nih.gov/cards/

  19. GeneCards-II

  20. GeneCards-III

  21. GeneCards-IV

  22. GeneCards-V

  23. Genetic/Medical Information • OMIM, Online Mendelian Inheritance in Man (NCBI) • The OMIM database is a catalog of human genes and genetic disorders • >13,000 entries (April, 2002) • Examples: cystic fibrosis, prions, amyloid precursor protein • Condensed, highly curated descriptions of genetics/disease/animal models/references

  24. OMIM-I (http://www3.ncbi.nlm.nih.gov/Omim/)

  25. OMIM-II

  26. OMIM-III

  27. For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)

  28. Genome Browsing • Three public • Open access • Use same genome build/assembly • NCBI (U.S.) • UCSC (Santa Cruz, U.S.) • EnsEmbl (EBI, EU) • One private • Restricted, commercial • Academic, free usage: 1 Mbase/week • Proprietary assembly • Celera Genomics (U.S.)

  29. Celera Human/Mouse Genomes

  30. Genome Browsers - Portals to the Genomic World • NCBI – National Center for Biotechnology Information (U.S.) • http://www.ncbi.nlm.nih.gov/Genomes/index.html • UCSC – Univ. California – Santa Cruz (U.S.) • http://genome.ucsc.edu/ • EnsEmbl – European Molecular Biology Laboratory (E.U.) • http://www.ensembl.org/

  31. NCBI

  32. NCBI

  33. UCSC – Genome Browser

  34. UCSC – Genome Browser II

  35. EnsEmbl – Genome Browser

  36. EnsEmbl – Genome Browser

  37. EnsEmbl – Genome Browser

  38. EnsEmbl – Genome Browser

  39. EnsEmbl – Genome Browser

  40. EnsEmbl – Genome Browser

  41. For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs) or Gene Structure? (Prediction servers) • ...and evaluate the reliability of prediction methods

  42. CBS Services/Toolbox http://www.cbs.dtu.dk/services/

  43. NetPhos – a prediction server http://www.cbs.dtu.dk/services/NetPhos/

  44. NetPhos – a prediction server

  45. Evaluating Prediction Servers • Performance on independent/cross-validated data presented? • Published in peer-reviewed journal? • Cited by others? • Science Citation Index • Linked to from credible web sites? • Google Page-rank • ”link:URL” search

  46. Evaluating Prediction Servers

  47. 2can Bioinformatics Education • At EBI – European Bioinformatics Institute • http://www.ebi.ac.uk/2can/index.html • Tutorials, resource links, etc.

  48. Starting Points • General Bioinformatics • NCBI, National Center for Biotechnology Information, U.S. • EBI, European Bioinformatics Institute • Prediction Tools • CBS, DK • Expasy (Protein analysis), Switzerland

More Related