540 likes | 645 Views
”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Outline. Magnitudes and Scales Resources: Data Sources & Tools
E N D
”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk
Outline • Magnitudes and Scales • Resources: Data Sources & Tools • Primary DNA sources • Sequence Repositories • Structure Repositories • Functional Categorization • Integration of Databases • The Human Genome • Genome Browsers • Prediction Tools • Evaluation of Prediction Servers • Starting points • Link collections
Resources: Sources & Tools • There is A LOT OF biomolecular databases/sources • A LOT OF overlap of information/redundancy • A LOT OF TOOLS • Personal picks/preferences • User-friendliness • Update intervals • Curation efforts / error correction • Linkage to other DBs
Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001
Magnitudes and Scales • Human genome 3,200,000,000 bp • Single basepair full genome is 9 orders of magnitude • Genome = Football field: ~3 billion leaves of grass • Single base A T G C (or SNP) =1 leaf of grass • Genome browsing • Zooming from whole stadium to single leaf
How we got the sequence • Sanger chain termination method
Primary DNA sources • Trace files repositories • Single read: 500-1000 bp (~golf ball size/ jig saw puzzle) • Variable quality • WashU-Merck Human EST Project / Trace files • ”Base-calling” non-trivial
Sequence repositories - GenBank et al. • GenBank / EMBL / DDBJ • Highly redundant (many versions of same gene) • Cross-updated daily • Version history is recorded • Previous sequence records can be retrieved • Contigs/HTGS (100-200 kb) finishing at different stages • DraftFinished • Includes genomic DNA, cDNA, ESTs, translated peptides
Non-redundant and Curated databases • Non-redundant • Manual or automatic curation • DNA • RefSeq (NCBI; semi-automated) • Ensembl gene index (automated) • Protein • RefSeq (NCBI; semi-automated) • TrEMBL (EMBL; automated)
Curated database: UniProt/SwissProt • SIB - Swiss Institute of Bioinformatics • Protein Knowledgebase / Sequence Database • Highly curated • Experimental evidence evaluated (e.g. modifications) • All 80,000 entries checked by Amos Bairoch himself ;-) • ExPASy - Expert Protein Analysis System • Proteomics tools: links + local servers
Structure databases / Protein Data Bank (PDB) • X-ray , NMR biomolecular structures • Protein Data Bank (PDB) • >22,000 structures(April 2003) • http://www.rcsb.org/pdb/
Functional Categorization • Gene Ontology (GO) • Hierarchical • Controlled vocabulary
Functional Categorization • Gene Ontology (GO) http://www.geneontology.org/ • Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase • Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
Integration of databases - Webs of web-sites • Links, links, links... • SRS = Sequence Retrieval System • Powerful, complex query language • BioDAS – Distributed Annotation System http://srs.ebi.ac.uk/
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)
Genetic/Medical Information • OMIM, Online Mendelian Inheritance in Man (NCBI) • The OMIM database is a catalog of human genes and genetic disorders • >13,000 entries (April, 2002) • Examples: cystic fibrosis, prions, amyloid precursor protein • Condensed, highly curated descriptions of genetics/disease/animal models/references
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs)? (Prediction servers) • (Evaluate the value of predicted features)
Genome Browsing • Three public • Open access • Use same genome build/assembly • NCBI (U.S.) • UCSC (Santa Cruz, U.S.) • EnsEmbl (EBI, EU) • One private • Restricted, commercial • Academic, free usage: 1 Mbase/week • Proprietary assembly • Celera Genomics (U.S.)
Genome Browsers - Portals to the Genomic World • NCBI – National Center for Biotechnology Information (U.S.) • http://www.ncbi.nlm.nih.gov/Genomes/index.html • UCSC – Univ. California – Santa Cruz (U.S.) • http://genome.ucsc.edu/ • EnsEmbl – European Molecular Biology Laboratory (E.U.) • http://www.ensembl.org/
For ’my gene’, how do I: • Get an overview of the sequence information known? (GeneCards) • Examine the ’Genome Neighbourhood’? (Genome Browsers) • Predict protein post-translational modifications (PTMs) or Gene Structure? (Prediction servers) • ...and evaluate the reliability of prediction methods
NetPhos – a prediction server http://www.cbs.dtu.dk/services/NetPhos/
Evaluating Prediction Servers • Performance on independent/cross-validated data presented? • Published in peer-reviewed journal? • Cited by others? • Science Citation Index • Linked to from credible web sites? • Google Page-rank • ”link:URL” search
2can Bioinformatics Education • At EBI – European Bioinformatics Institute • http://www.ebi.ac.uk/2can/index.html • Tutorials, resource links, etc.
Starting Points • General Bioinformatics • NCBI, National Center for Biotechnology Information, U.S. • EBI, European Bioinformatics Institute • Prediction Tools • CBS, DK • Expasy (Protein analysis), Switzerland