Access to Sequence Data and Literature Information

Access to Sequence Data and Literature Information Department of Computer and Information Science Spring 2012 CISC 4020 Bioinformatics

NCBI • National Center for Biotechnology Information @ National Institute of Health • http://www.ncbi.nlm.nih.gov/ • NCBI hosts GenBank • a database of most known public DNA, RNA, and protein sequences (about 100 billion nucleotides from 100 million sequences). • It contains bibliographic and biological annotation for these sequences. CISC 4020 Bioinformatics

Growth of GenBank Base pairs of DNA (millions) Sequences (millions) 1982 1986 1990 1994 1998 2002 Year CISC 4020 Bioinformatics

Three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan DNA Data Bank of Japan The underlying raw DNA sequences are identical CISC 4020 Bioinformatics Page 14

Taxonomy at NCBI: >200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi CISC 4020 Bioinformatics

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov CISC 4020 Bioinformatics

Entrez @ NCBI • A search and retrieval system • Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes CISC 4020 Bioinformatics

Entrez Gene • A great starting point • It collects key information on each gene/protein from major databases. It covers all major organisms. CISC 4020 Bioinformatics

Accession number • A label for a molecular sequence. • Different databases have characteristic formats. • GenBank formats • X02775 • AF025334 • In addition, NCBI assigns unique sequence ids to the individual sequences within a record. E.g. GI:28302128 CISC 4020 Bioinformatics

Example – beta globin • There are thousands of accession numbers for beta globin – thousands of core nucleotide sequences. • full length vs. partial. • naturally variants vs. spliced transcripts. • errors. CISC 4020 Bioinformatics

Example:beta globin From the NCBI home page, type “beta globin” and hit “Search” CISC 4020 Bioinformatics

CISC 4020 Bioinformatics

RefSeq : The Reference Sequence Project • One RefSeq entry for a given gene or gene product • NM_000000 (or more digits) • Several RefSeq entries if there are splice variants or distinct loci. • Example : myoglobin has 3 distinct splice variants, which encode an identical protein. • NM_005368, NM_203377, and NM_203378 • NP_005359, NP_976311, and NP_976312 for corresponding proteins (different sources) CISC 4020 Bioinformatics

Follow the link to “Gene” CISC 4020 Bioinformatics

Entrez Gene is in the header Note the “Official Symbol” HBB for beta globin Note the “limits” option CISC 4020 Bioinformatics

Using “limits” you can restrict your search to human (or any other organism) CISC 4020 Bioinformatics

By applying limits, there are now far fewer entries CISC 4020 Bioinformatics

Entrez Gene (top of page): Note a useful summary, and links to other databases CISC 4020 Bioinformatics

Entrez Gene (middle of page): genomic region, Gene sequence CISC 4020 Bioinformatics

GeneBank Entry CISC 4020 Bioinformatics

Gene Sequence Lower part of the Entry Page CISC 4020 Bioinformatics

Entrez Gene (middle of page, continued): phenotypes, function CISC 4020 Bioinformatics

Entrez Gene (bottom of page): RefSeq accession numbers for Protein CISC 4020 Bioinformatics

Entrez Protein: accession, organism, literature… Fig. 2.8 CISC 4020 Bioinformatics

Entrez Protein: …features of a protein, and its sequence in the one-letter amino acid code Fig. 2.8 CISC 4020 Bioinformatics

One-letter amino acid code CISC 4020 Bioinformatics

Entrez Protein: You can change the display (as shown)… CISC 4020 Bioinformatics

FASTA format: versatile, compact with one header line, followed by a string of nucleotides or amino acids in the single letter code CISC 4020 Bioinformatics

Entrez Gene & Entrez Protein • Begin your search with Entrez Gene. • Identify the official gene name • Identify chromosomal location of the gene. • Identify RefSeq accession numbers associated with all the DNA and protein variants. • Visit Entrez Protein • Obtain FASTA format of the protein CISC 4020 Bioinformatics

PubMed @ NCBI • National Library of Medicine's search service • 21 million citations in MEDLINE (as of 2011). • links to online full-text journal articles and the integrated molecular biology databases. • PubMed tutorial on the site or visit NLM: http://www.nlm.nih.gov/bsd/disted/pubmed.html CISC 4020 Bioinformatics

BLAST @ NCBI • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day CISC 4020 Bioinformatics

OMIM @ NCBI • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • contains links to PubMed articles and sequence information. • created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI CISC 4020 Bioinformatics

TaxBrowser @ NCBI • A taxonomy browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses). • It features taxonomy information • genetic codes • Taxonomy resources • molecular data on extinct organisms • Recent changes to classification schemes. • Practically useful to find a protein or gene from a species • Each organism or group in GenBank is assigned a unique taxonomy identifier. • txid9606 for human (Homo sapiens) CISC 4020 Bioinformatics

How many gene entries for Human Being? CISC 4020 Bioinformatics

Result CISC 4020 Bioinformatics

Proteins for Human Being CISC 4020 Bioinformatics

Proteins of molecular weight 2,000 to 10,000 daltons. Txid9606[Organism:exp] AND 002000:010000[molwt] CISC 4020 Bioinformatics

Result CISC 4020 Bioinformatics

Tips for using Entrez databases • Use the Boolean operators AND, OR and NOT (must be capitalized). • To narrow your search to a particular organism, try Taxonomy with the organism first. • Limit the weight of protein by adding XXXXXX:YYYYYY[molwt] to the query. • Put * as the surffix of your query, you get all records that begin with a particular text string. CISC 4020 Bioinformatics

Access to Sequence Data and Literature Information

Access to Sequence Data and Literature Information

Presentation Transcript

Access to Information:

Access to Information:

Time and Sequence in Literature

Information and literature review

Sequence data

Data and Information Access Update

Access to Places and Information

Access to Literature

Sequence data

Access Literature

Mass Media and Access to Information

The Internet and Access to Information

Sequence information and file formats

Access to Information and Privacy Awareness

Sequence motifs, information content, and sequence logos

Sequence motifs, information content, and sequence logos

User Access to Land Cover Data and Information

Access to Contraceptive and its Information

World Access To Information

Access to Information, Participation, and Governance

Consultation and Access to Information

Sequence motifs, information content, and sequence logos