1 / 61

Bioinformatics

Bioinformatics. Biological Databases Revised 17/09/13. Database architecture. What should be stored How should it be stored. Refers to the manner in the entries in a database are organized for archiving easy retrieval (queries). Database architecture. Relational database.

piper
Download Presentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Biological Databases Revised 17/09/13

  2. Database architecture What should be stored How should it be stored • Refers to the manner in the entries in a database are organized • for archiving • easy retrieval (queries) Database architecture

  3. Relational database Data are stores in tables Relationships between records can be many to one or many to many. In the latter case an index is required. All records in a table have identical features A record is identified by its table and record identifier For each new feature we need a new table Navarro et al., 2003

  4. Object oriented database Record is defined by the entire hierarchy eg pTyr Root/Proteins/Protein1/Modifications/Ptyr Relationships between records are of a parent/child type Easy to automatically update Navarro et al., 2003

  5. Standardization • Requires standardized data format • MIAMI (microarray data) • HAWK (sequence data) • Requires intelligent knowledge bases

  6. Introduction • Repository databases • Redundant • High low quality • Cutting edge information • Curated databases • Manual & automatic curation • Organization of information important • But mainly annotated entries • An attempt to be nonredundant • Comprehensive in some cases

  7. Sequence databases

  8. Sequence Formats • A sequence file needs to be recognized by a computer program, • special formats have been invented • FastA • GenBank

  9. Sequence formats GenBank

  10. Sequence Repositories at Ncbi • http://www.ncbi.nih.gov/Database/index.html • GenBank uses a relational model • New sequences can be submitted by a submission page. • GenBank also accepts submission of sequences with a high error rate and provides curated databases (99% accuracy) • 200000 users a day, 4 million queries a day

  11. NCBI

  12. NCBI Repository databases

  13. Sequence retrieval at Ncbi through ENTREZ ENTREZ, a resource prepared by NCBI is used to retrieve a DNA or protein sequence or Medline from the databases at NCBI.

  14. Sequence Repositories at Ncbi: GenBank Redundant number of entries => need for a comprehensive database

  15. Limit search in Entrez, allows complex queries

  16. GenBank format: DNA sequence

  17. GenBank format: protein sequence

  18. +1 protein Sequence Repositories at Ncbi: EST DNA transcription mRNA translation protein

  19. EST EST represent first pass sequences with an error rate as high as 1 in 100, including incorrectly identified bases and insertions http://www.ncbi.nlm.nih.gov/dbEST/

  20. EST Aid in gene prediction: extrinsic gene finding methods Fielden et al. 2002

  21. Comprehensive databases • Curated databases • Unigene (Ncbi): automatic partitioning of GenBank into a non-redundant set of gene-oriented clusters • RefSeq (Ncbi): • ENSEMBL/VEGA (Ebi): Integrate the information as such that for a locus in the genome a complete description is given that is no longer redundant Provide a comprehensive non redundant set of sequences including genomic DNA, transcript and protein products for major research organisms

  22. Comprehensive DB: UniGene

  23. UniGene

  24. Comprehensive DB: UniGene • UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters • Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue types in which the gene has been expressed and map location. • These clusters represent the same gene based on the alignment of EST sequences with each other and with the genome sequences of the organism. • no attempt has been made to produce contigs • splicing variants for a gene are put into the same set. • Moreover, EST-containing sets often contain 5' and 3' reads from the same cDNA clone, but these sequences do not always overlap.

  25. UniGene As more overlapping sequences are added the number of clusters for an organism decreases

  26. Comprehensive DB: UniGene

  27. Comprehensive DB: REfSeq • For a particular gene many independent redundant records might exist in GenBank • All this information is integrated as such that for a particular locus in the genome a complete description is given that is no longer redundant: the locuslink • Redundant GenBank entries e.g. representing distinct indications on the transcript of a gene (incomplete cDNA sequences, ESTs) are unified to a single refseq that represents the complete transcript • A Refseq sequence • protein (starting with NP_) • a genomic sequence (starting with NG_) • All RefSeq sequences that belong to the same locus on the genome receive the same locus link • Additional links to other interesting databases containing additional functional annotation or information are made (e.g to Gene Ontology, KEGG,…)

  28. RefSeq

  29. Gene: RefSeq

  30. Comprehensive DB:Ensembl

  31. Human protein (Swiss Prot) Other proteins cDNA EST Genewise Blast exonerate exonerate Add UTR Ab initio gene prediction Cluster merge M cluster Merge (UniGene) merge GeneScan Add variants EST genes Genes Comprehensive DB:Ensembl

  32. Comprehensive DB:Ensembl Automatic pipeline of Ensembl

  33. Ensembl • Ab initio gene scan: doesn’t use protein/cDNA/EST evidence • More genomes available: gene predictions will improve • ENSEMBL: 70-75% genes annotated • EST genes used to help predicting UTR and splice variants • Problem automatic annotation: pseudogenes Processed (with poly A tail) pseudogene Unprocessed (rearrangement, duplication)

  34. AUTOMATIC Weeks Use draft sequence No pseudogenes MANUAL Months Need finished sequence Pseudogenes Consult public databases/ literature Ensembl • ENSEMBL: automatic analysis flow • VEGA (vertebrate genome annotation database) database: manual curation • refSeq: best curated database for cDNAs (no integration with ESTs (<-> VEGA)

  35. Vega

  36. Other databases

  37. Expression Databases • Microarray database: • SMD (Stanford) • Miami express (Ebi) • GEO (Ncbi) • SAGE data base • EST based expression database • Proteome database

  38. SGD

  39. SGD

  40. SGD

  41. DDD • http://www.ncbi.nlm.nih.gov/UniGene/ddd.cgi?ORG=Hs

  42. Pathway database KEGG

  43. Ontologies Controlled vocabularies Tree structured Describe gene products and associated processes Species independent • Gene Ontology • Ecocyc

  44. Ontologies GO: gene ontology • Organize biological information about proteins classes and functions into a hierarchical classification using controlled vocabulary http://www.ensembl.org/Homo_sapiens/goview?query=GO%3A0003700

  45. GO

  46. GO

  47. GO

  48. GO

  49. GO

More Related