180 likes | 463 Views
Databases. Vasileios Hatzivassiloglou University of Texas at Dallas. Databases. Massive independently developed databases Sponsored by national institutes of biology/bioinformatics/health in the U.S., Europe, and Japan Allow for search
E N D
Databases Vasileios Hatzivassiloglou University of Texas at Dallas
Databases • Massive independently developed databases • Sponsored by national institutes of biology/bioinformatics/health in the U.S., Europe, and Japan • Allow for search • Allow for entry of information by researchers, subject to curation • Cross-linked
GenBank • Developed and maintained by the U.S. National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH) • Repository of gene information • Provides DNA and literature search, comparison tools
GenBank statistics • Dynamically updated • A new version released as a flat file every two months • Current version is 167.0 (15 August 2008) • 92.7 (76.1/61.1 one/two years ago) million sequences • 95.0 (79.5/65.3 one/two years ago) billion base pairs
Sample GenBank record • LOCUSSCU498455028 bpDNAPLN21-JUN-1999 • DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. • ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . • SOURCE Saccharomyces cerevisiae (baker's yeast) • ORGANISM Saccharomyces cerevisiae – Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. • REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 PUBMED 7871890 • REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., ... • FEATURES(location, CDS, 5′ UTR, 3′ UTR, promoter, alternative splicing, ...) • BASE COUNT 1510 a 1074 c 835 g 1609 t • ORIGIN • 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa ...
Genome Map Viewer • Online at http://www.ncbi.nlm.nih.gov/mapview/
SwissProt • Maintained by the Swiss Institute for Bioinformatics • Central repository of proteomic data • Part of the Swiss ExPASy (Expert Proteomic Analysis System) • Currently (Release 56.1, 2 Sept 2008) • 397,500 (283,400 sequences one year ago) • 143 million amino-acids (104 million one year ago)
TrEMBL • The non-curated counterpart to SwissProt • Computer-annotated protein sequences awaiting curation • Current release 39.2 (2 Sept 2008) • 6.2 million proteins (4.7 million a year ago) • 2 billion aminoacids (1.5 billion a year ago) • SwissProt has entries for • only 6.5% of TrEMBL (6% a year ago)
SwissProt record • Entry information • Entry nameIL3_HUMAN • Primary accession numberP08700 Secondary accession numbers None • Entered in Swiss-Prot in Release 06, January 1988; Sequence was last modified in Release 12, October 1989 (revision 2); Annotations were last modified in Incremental Release, July 22, 2008 (revision 95) • Name and origin of the protein • Protein nameInterleukin-3 [Precursor] • SynonymsIL-3; Multipotential colony-stimulating factor; Hematopoietic growth factor; P-cell stimulating factor; Mast-cell growth factor; MCGF • Gene nameName: IL3 • FromHomo sapiens (Human) [TaxID: 9606] TaxonomyEukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo. References[1]NUCLEOTIDE SEQUENCE.DOI=10.1016/0378-1119(87)90254-X; PubMed=3497843 [NCBI, ExPASy, EBI, Israel, Japan]Dorssers L., Burger H., Bot F., Delwel R., Geurts van Kessel A.H.M., Loewenberg B., Wagemaker G.;"Characterization of a human multilineage-colony-stimulating factor cDNA clone identified by a conserved noncoding sequence in mouse interleukin-3.";Gene 55:115-124(1987). • Cross-references (links to databases for sequence, gene expression, 3D structure, interactions, ...) • Features (functional and structural components of the protein) • Sequence Information (152 aminoacids)
Cross-References and Features • Entry for Human Interleukin-3 • http://www.expasy.org/uniprot/P08700 • Feature viewer for the same protein • http://www.expasy.org/cgi-bin/ft_viewer.pl?P08700
The Gene Ontology (GO) • Hierarchical classification of genes, cross-linked across species • Classification of related terminology • Searchable via AmiGO
Browsing with AmiGo • AmiGo starting point for browsing and search • http://www.godatabase.org/cgi-bin/amigo/go.cgi?search_constraint=terms&action=replace_tree
PubMed • Interface to MEDLINE, NLM’s searchable index of publications in the biomedical field • More than 15 million records since the 1950’s • Can retrieve abstracts and citation details, but not full text • Organized via the MeSH metathesaurus
MeSH • Controlled vocabulary of 24,700 subject headings or descriptors • Each linked to synonymous entry terms (151,000 of those) • Articles in Medline are indexed using subject headings; each gets 1-2 major and about 10 other MeSH terms • During search, entry terms are mapped to descriptors and related terms are added to the query (query expansion)
Medline record example • Disambiguating proteins, genes, and RNA in text: a machine learning approach. • Author(s): Hatzivassiloglou V; Duboué PA; Rzhetsky A • Author's Address: Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA. vh@cs.columbia.edu • Source: Bioinformatics. [Bioinformatics] 2001; Vol. 17 Suppl 1, pp. S97-106. • Publication Type: Evaluation Studies; Journal Article • Language: English • Journal Information: Country of Publication: England NLM ID: 9808944 ISSN: 1367-4803 Subsets: MEDLINE • MeSH Terms: Artificial Intelligence*Genes*Proteins*RNA*Algorithms; Bayes Theorem; Comparative Study; Computational Biology; Data Collection; Natural Language Processing; Research Support, Non-U.S. Gov't • Abstract: We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%. • CAS Registry Number:0 (Proteins) 63231-63-0 (RNA) • Entry Date(s):Date Created: 20010726 Date Completed: 20020225 Latest Revision: 20041117 • Update Code:20041218 PMID:11472998 • Database: MEDLINE