410 likes | 558 Views
Access to Sequence Data and Literature Information. Department of Computer and Information Science Spring 2012. NCBI. N ational C enter for B iotechnology I nformation @ National Institute of Health http://www.ncbi.nlm.nih.gov/ NCBI hosts GenBank
E N D
Access to Sequence Data and Literature Information Department of Computer and Information Science Spring 2012 CISC 4020 Bioinformatics
NCBI • National Center for Biotechnology Information @ National Institute of Health • http://www.ncbi.nlm.nih.gov/ • NCBI hosts GenBank • a database of most known public DNA, RNA, and protein sequences (about 100 billion nucleotides from 100 million sequences). • It contains bibliographic and biological annotation for these sequences. CISC 4020 Bioinformatics
Growth of GenBank Base pairs of DNA (millions) Sequences (millions) 1982 1986 1990 1994 1998 2002 Year CISC 4020 Bioinformatics
Three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan DNA Data Bank of Japan The underlying raw DNA sequences are identical CISC 4020 Bioinformatics Page 14
Taxonomy at NCBI: >200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi CISC 4020 Bioinformatics
National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov CISC 4020 Bioinformatics
Entrez @ NCBI • A search and retrieval system • Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes CISC 4020 Bioinformatics
Entrez Gene • A great starting point • It collects key information on each gene/protein from major databases. It covers all major organisms. CISC 4020 Bioinformatics
Accession number • A label for a molecular sequence. • Different databases have characteristic formats. • GenBank formats • X02775 • AF025334 • In addition, NCBI assigns unique sequence ids to the individual sequences within a record. E.g. GI:28302128 CISC 4020 Bioinformatics
Example – beta globin • There are thousands of accession numbers for beta globin – thousands of core nucleotide sequences. • full length vs. partial. • naturally variants vs. spliced transcripts. • errors. CISC 4020 Bioinformatics
Example:beta globin From the NCBI home page, type “beta globin” and hit “Search” CISC 4020 Bioinformatics
RefSeq : The Reference Sequence Project • One RefSeq entry for a given gene or gene product • NM_000000 (or more digits) • Several RefSeq entries if there are splice variants or distinct loci. • Example : myoglobin has 3 distinct splice variants, which encode an identical protein. • NM_005368, NM_203377, and NM_203378 • NP_005359, NP_976311, and NP_976312 for corresponding proteins (different sources) CISC 4020 Bioinformatics
Follow the link to “Gene” CISC 4020 Bioinformatics
Entrez Gene is in the header Note the “Official Symbol” HBB for beta globin Note the “limits” option CISC 4020 Bioinformatics
Using “limits” you can restrict your search to human (or any other organism) CISC 4020 Bioinformatics
By applying limits, there are now far fewer entries CISC 4020 Bioinformatics
Entrez Gene (top of page): Note a useful summary, and links to other databases CISC 4020 Bioinformatics
Entrez Gene (middle of page): genomic region, Gene sequence CISC 4020 Bioinformatics
GeneBank Entry CISC 4020 Bioinformatics
Gene Sequence Lower part of the Entry Page CISC 4020 Bioinformatics
Entrez Gene (middle of page, continued): phenotypes, function CISC 4020 Bioinformatics
Entrez Gene (bottom of page): RefSeq accession numbers for Protein CISC 4020 Bioinformatics
Entrez Protein: accession, organism, literature… Fig. 2.8 CISC 4020 Bioinformatics
Entrez Protein: …features of a protein, and its sequence in the one-letter amino acid code Fig. 2.8 CISC 4020 Bioinformatics
One-letter amino acid code CISC 4020 Bioinformatics
Entrez Protein: You can change the display (as shown)… CISC 4020 Bioinformatics
FASTA format: versatile, compact with one header line, followed by a string of nucleotides or amino acids in the single letter code CISC 4020 Bioinformatics
Entrez Gene & Entrez Protein • Begin your search with Entrez Gene. • Identify the official gene name • Identify chromosomal location of the gene. • Identify RefSeq accession numbers associated with all the DNA and protein variants. • Visit Entrez Protein • Obtain FASTA format of the protein CISC 4020 Bioinformatics
PubMed @ NCBI • National Library of Medicine's search service • 21 million citations in MEDLINE (as of 2011). • links to online full-text journal articles and the integrated molecular biology databases. • PubMed tutorial on the site or visit NLM: http://www.nlm.nih.gov/bsd/disted/pubmed.html CISC 4020 Bioinformatics
BLAST @ NCBI • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day CISC 4020 Bioinformatics
OMIM @ NCBI • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • contains links to PubMed articles and sequence information. • created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI CISC 4020 Bioinformatics
TaxBrowser @ NCBI • A taxonomy browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses). • It features taxonomy information • genetic codes • Taxonomy resources • molecular data on extinct organisms • Recent changes to classification schemes. • Practically useful to find a protein or gene from a species • Each organism or group in GenBank is assigned a unique taxonomy identifier. • txid9606 for human (Homo sapiens) CISC 4020 Bioinformatics
How many gene entries for Human Being? CISC 4020 Bioinformatics
Result CISC 4020 Bioinformatics
Proteins for Human Being CISC 4020 Bioinformatics
Proteins of molecular weight 2,000 to 10,000 daltons. Txid9606[Organism:exp] AND 002000:010000[molwt] CISC 4020 Bioinformatics
Result CISC 4020 Bioinformatics
Tips for using Entrez databases • Use the Boolean operators AND, OR and NOT (must be capitalized). • To narrow your search to a particular organism, try Taxonomy with the organism first. • Limit the weight of protein by adding XXXXXX:YYYYYY[molwt] to the query. • Put * as the surffix of your query, you get all records that begin with a particular text string. CISC 4020 Bioinformatics