Bioinformatics Overview, NCBI & GenBank

Bioinformatics Overview, NCBI & GenBank JanPlan 2012

What is Bioinformatics • Find three different definitions of the word “bioinformatics” • How is “bioinformatics different from “computational biology”? • What areas of biological research are dependent on bioinformatics?

What is Bioinformatics Used For? • Database searching • Sequence analysis • Phylogenetic reconstruction • Molecular evolution • Gene expression • Genome assembly • Genome annotation • Metagenomics

Introduction to NCBI • NCBI, EMBL & DDBJ • What function do these organizations play in the global society? • How do their missions differ? • NCBI Training and Tutorials page • The NCBI Handbook • NCBI How-To page • NCBI Help Manual

GenBank • Annotated collection of all publicly available nucleotide sequences and their protein translations. • Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. • Grows exponentially, doubling every 10 months

GenBank • Initially built and maintained at Los Alamos National Laboratory. • Transferred to NCBI in early 1990s by congressional mandate. • Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited. • Submitters may keep their data confidential for a specified period of time prior to publication.

Direct Submission • A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence (contigs) with annotations (metadata). • If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped. • Example

High-Throughput Genomic Sequence (HTGS) • HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. • Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.

High-Throughput Genomic Sequence (HTGS) • Data submitted in 4 phases. • Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are low-quality sequences that are often used to check whether another center is already sequencing a particular clone. • Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known. • Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation. • Phase 3: Sequences are of finished quality and have no gaps. For each organism, the group overseeing the sequencing effort determines the definition of finished quality.

Whole Genome Shotgun Sequences (WGS) • Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

EST, STS, and GSS • EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation. • STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping. • GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.

HTC and FLIC • HTC = High-Throughput cDNA/mRNA: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region. • FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.

Submission Tools • BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. • Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.

Sequence Data Flow and Processing • Triage: Within 48 hours of direct submission with BankIt or Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an Accession number. • All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence. • GenBank will not accept sequences constructed in silico • GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers. • GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA. • Submissions are checked to determine whether they are new or updates.

Sequence Data Flow and Processing • Indexing: • Biological validity: Translation, organism lineage, BLAST searches • Vector contamination: Is there any vector DNA present in the sequence? • Publication status: If published, citation is included in annotation and linked to Entrez • Formatting and spelling • Sequences are sent to submitter for final review before release into the public database. • Sequences must become publicly available once the accession number or the sequence has been published. • GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.

RefSeq • A curated collection of DNA, RNA, and protein sequences built by NCBI. • Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. • May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts. • Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Third Party Annotation (TPA) database • Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal. • Two types of records: • Experimental: Annotation supported by wet-lab evidence • Inferential: Annotation inferred only • Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

Universal Protein Resource (UniProt) • Protein sequence database that was formed through the merger of three protein databases: • The Swiss Institute of Bioinformatics • The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data Library (TrEMBL) databases • Georgetown University’s Protein Information Resource Protein Sequence Database (PIR-PSD)

Problem Set • ftp://ftp.ncbi.nih.gov/pub/education/tutorials/genbank.pdf • Linked on today’s web page

Bioinformatics Overview, NCBI & GenBank