بسم الله الرحمن الرحیم

بسم الله الرحمن الرحیم

Using NCBI Resources for Gene Discovery Lecturer: Dr. FarkhondehPoursina, PhD poursina@med.mui.ac.ir 1392 National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health http://www.ncbi.nlm.nih.gov/

Nucleic acid & Protein EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) GenBank (NCBI,The National Center for Biotechnology Information) Primary biological databases

EMBL/GenBank/DDJB • These 3 db contain mainly the same information (few differences in the format) • Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) • derived from: • Genome projects and sequencing centers • Individual scientists • Non-confidential data are exchanged daily • Currently: 2.5 x107 sequences, over 3.2 x1010 bp; • Sequences from > 50,000 different species;

THE ‘PERFECT’ DATABASE • Comprehensive, but easy to search. • Annotated, but not “too annotated”. • A simple, easy to understand structure. • Cross-referenced. • Minimum redundancy. • Easy retrieval of data.

The National Center for Biotechnology Information Bethesda,MD • Created in 1988 as a part of the • National Library of Medicine at NIH(National Institutes of Health) • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

Web Access: www.ncbi.nlm.nih.gov New pages! New Homepage Common footer

TYPES OF MOLECULAR DATABASES(Sequence) at NCBI • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, Trace, SRA, SNP, GEO • Derivative Databases • Derived from primary data • Curated/expert review(Content controlled by third party (NCBI) • compilation and correction of data • Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene, Structure, Conserved Domain

ACGTGC C C GA GA ATT GA GA C ATT TATAGCCG AGCTCCGATA CCGATGACAA RefSeq C TATAGCCG ACGTGC Curators CGTGA ATTGACTA TTGACA Genome Assembly TTGACA TTGACA ACGTGC ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA TATAGCCG ATTGACTA ATTGACTA CGTGA ATTGACTA ATTGACTA ATT TATAGCCG TATAGCCG TATAGCCG TATAGCCG TATAGCCG TTGACA C GenBank UniGene GA AT C C C C ATT GA GA GA GA ATT ATT ATT Algorithms GA GA GA GA C C ATT ATT C C PRIMARY VS. DERIVATIVE SEQUENCE DATABASES Labs Sequencing Centers Updated continually by NCBI Updated ONLY by submitters

The Problem • Rapidly growing databases with complex and changing relationships • Rapidly changing interfaces to match the above Result • Many people don’t know: • Where to begin • Where to click on a Web page • Why it might be useful to click there

Derivative Sequence Databases

ENTREZFINDING RELEVANT INFORMATION IN NCBI DATABASES

You can search DNA sequence database Retrieve known sequences by • ENTREZ • http://www.ncbi.nlm.nih.gov/Entrez/ • Click – Nucleotide • OR • Accession number • Keyword search

Entrez is Internally Cross-linked • DNA and protein sequences are linked to other similar sequences • Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar structures

Databases contain more than just DNA &protein sequences

Retrieve all sequences for an organism or taxon • Starting with an organism or taxon name... • How to: Download the complete genome for an organism • Starting at the Genomes

How to: Find transcript sequences for a gene • Starting with ... • A GENE NAME, PRODUCT NAME, OR SYMBOL • How to: Obtain genomic sequence for/near a gene, marker, transcript or protein • Starting with... • A GENE NAME OR SYMBOL

Entrez Protein Gene Other Entrez DBs HomoloGene UniGene ENTREZ TIP: START SEARCHES IN GENE BLink Homologene: Gene Neighbors

How to: Display genomic annotation graphically • Starting with... • A NUCLEOTIDE RECORD (e.g. NC_000001)

By applying limits, there are now just two entries

Precise Results

A Traditional GenBank Record Molecular weight Locus Field Molecule Type ACCESSION NO ACCESSION VERSSION Modification Date Definition Line Genbank Division GI (GenInfo) Taxonomy Submission Field

Traditional GenBank Record • Accession • Stable • Reportable • Universal ACCESSION U07418 VERSION U07418.1 GI:466461 Coding sequence Version Tracks changes in sequence GI number NCBI internal use the sequence is the data

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27

Feature Table GenPept Record Genomic DNA Sequence

GenPept: GenBank CDS translations FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

Reference Sequences • Nucleotide sequences and protein translation • Curated by NCBI or NCBI-approved programs. • Difference between GenBank and RefSeq • GenBank has raw data and duplicated records • Metadata in GenBank can be incomplete • RefSeq annotated, curated and non-redundant. • NCBI takes best sequences from GenBank and • curates for RefSeq records RefSeq

Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes AC_123455 Alternate assemblies Assemblies NT_123456Contig NW_123456WGSSupercontig

over 100,000 nucleotide entries for HIV-1 only 1 RefSeq

How to save? • Choose FASTA from the Display drop-down menu • Transform the content of this window into plain text by choosing Text from the drop-down menu located on the far right of the menu bar. • Save the FASTA sequence by using the following protocol: • a. In the Edit menu of your Web browser, click Select All and then • click Copy. • b. Open a default Word document and, in the Edit menu of Word, click Paste. • c. Finally, save your document as dUTPaseDNA.txt by choosing the Save as type option text only (*.txt).

FASTA format description • FASTA is a DNA and proteinsequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 • Popular Format and commonly used • A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

بسم الله الرحمن الرحیم

بسم الله الرحمن الرحیم

Presentation Transcript