790 likes | 807 Views
A comprehensive guide to searching tools, internet resources, and molecular biology databases for molecular biologists and researchers. Learn about PubMed, NCBI, genome browsers, structure visualization tools, and more.
E N D
Informatics for Molecular Biologists Ansuman Chattopadhyay,PhD Head, Molecular Biology Information Service Falk Library, Health Sciences Library System University of Pittsburgh
Molecular Biology Information Service Falk Library of Health Sciences Health Sciences Library System University of Pittsburgh 200 Scaife Hall Desoto and Terrace Streets Pittsburgh, PA 15261
Topics • Searching tools • Internet • PubMed • NCBI developed bioinformatics tools • Entrez Gene • Structure visualization tools • Cn3D • Genome Browsers • UCSC genome browsers • NCBI Map viewer
Information search space • Biomedical literature databases • Molecular databases • Organism whole genome sequences
Literature database • NCBI PubMed • contains over 15 million citations dating back to the mid-1950's. Search: “apoptosis”: 130,476 “breast cancer”: 160,055 “p53”: 42,418
Organisms whole genome sequences http://www.genomesonline.org/
Internet for Biologists • Google Vs Clusty • Google: Chronological list of search results • Clusty: Search results categorized into topical clusters Vivísimo's clustering technology creates topical categories on-the-fly from the search results, using terms in the title, snippet, and any other available textual description in the search results themselves
Google Vs Clusty • Search Example: Pittsburgh • Google • Clusty
Clusty Clusters help you see your search results by topic, so you can zero in on exactly what you’re looking for or discover unexpected relationships between items.
Search examples for Clusty • SNP • BLAST • Lupus
Web 2.0 • Website bookmark and tagging tool • Del.icio.us a social bookmarking web service for storing, sharing, and discovering web bookmarks.
Web 2.0 • Connotea; http://www.connotea.org/
Medline searching tool • PubMed vs ClusterMed Search example: macular degeneration, cell cycle, p53
Molecular databases • DNA Sequence Databases and Analysis Tools • Enzymes and Pathways • Gene Mutations, Genetic Variations and Diseases • Genomics Databases and Analysis Tools • Immunological Databases and Tools • Microarray, SAGE, and other Gene Expression • Organelle Databases • Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) • Plant Databases • Protein Sequence Databases and Analysis Tools • Proteomics Resources • RNA Databases and Analysis Tools • Structure Databases and Analysis Tools
HSLS OBRC • http://www.hsls.pitt.edu/guides/genetics/obrc/
Types of databases • By level of curation: • Archival • GenBank, GenPept, ssSNP • Curated • Refseq, SwissProt, RefSNP
Types of databases • Archival data • repository of information • redundant; might have many sequence records for the same gene, each from a different lab • submitters maintain editorial control over their records: what goes in is what comes out • no controlled vocabulary • variation in annotation of biological features Example: GenBank record
GenBank • archival database of nucleotide sequences from >130,000 organisms • records annotated with coding region (CDS) features also include amino acid translations • each record represents the work of a single lab • redundant; can have many sequence records for a single gene
Refseq • Curated data • non-redundant; one record for each gene, or each splice variant • each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article • records contain value-added information that have been added by an expert(s)
Refseq • Database of reference sequences • Curated • Non-redundant; one record for each gene, or each splice variant, from each organism represented • A representative GenBank record is used as the source for a RefSeq record • Value-added information is added by an expert(s) • Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article • Variety of accession number prefixes (NM_ , NP_ , etc.) and status codes (provisional, reviewed, etc.). More about those in later slides. • RefSeq database includes genomic DNA, mRNA, and protein sequences, so organizes information according to the model of the central dogma of biology
Searching GenBank • Find messenger RNA sequence for Human epidermal growth factor (EGF) gene.
Databases developers • NCBI • EBI
PubMed abstracts Taxonomy Genomes Nucleotide sequences Neighbors and Hard Links Word weight 3-D Structure 3 -D Structure Phylogeny VAST Protein sequences BLAST BLAST Source NCBI
Entrez Gene NCBI’s database for gene centric information focuses on organisms genome • completely sequenced • an active research community to contribute gene-specific information • scheduled for intense sequence analysis • Total Taxa: 4246; Total Genes: 284,3587 • 160,000 organisms in the nucleotide sequence database (Genbank)
Entrez gene • each record represents a single gene from a given organism Gene record includes: • a unique identifier or GeneID assigned by NCBI • a preferred symbol • and any one or more of: • sequence information • map information • official nomenclature from an authority list • alternate gene symbols • summary of gene/protein function • published references that provide additional information on function • expression • homology data • and more
Gene / Protein Exon-Intron Structure Chromosomal Localization mRNA Sequence Genomic Sequence Homologous Sequences SNP Expression Profile Amino acid Sequence 3D Structure Interacting Partners Disease
Entrez gene Find: • gene symbols and aliases • sequences: genomic, mRNA, protein • intron-exon architecture • genomic context: neighboring and antisense genes • Interacting partners • associated gene ontology terms: function, cellular component and biological process
Entrez Gene record Query: BRCA1 • Search Tips: • Query text box: BRCA1 • Limits: • To limit your search to a specific field, select: “Gene name” from drop-down menu • Limit by taxonomy: select “Homo sapiens” Name and aliases Chromosomal location
Entrez Gene: sequences and genomic context mRNA Seq Genomic Seq Sequences: mRNA, Genomic, Protein ProteinSeq
Transcription and alternative splicing Alternative splicing: http://www.exonhit.com/UserFiles/Image/epissage.swf?PHPSESSID=d9u8tiu2sioqa8u29bkop3l0l2
Entrez Gene: intron-exon architectures Tips: Change Display to “Gene Table” from “Summary”
mRNA Seq Genomic Seq ProteinSeq
Gene Ontology • Controlled vocabulary tagging • Function • Biological Processes • Cellular Component
Entrez Gene: Homologous sequence Tips: change Display settings from" summary” to “Alignment score” to “Multiple Alignment”
Single nucleotide polymorphisms Single nucleotide polymorphisms (SNP) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA