380 likes | 643 Views
Essential Bioinformatics and Biocomputing ( LSM2104: Section I) Lecture 3: More biological databases, retrieval systems and database searching. Biological databases Function and pathways databases - KEGG. KEGG ( http://www.genome.ad.jp/kegg/kegg2.html ) database links genetic
E N D
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Lecture 3: More biological databases, retrieval systems and database searching
Biological databasesFunction and pathways databases - KEGG KEGG (http://www.genome.ad.jp/kegg/kegg2.html) database links genetic info with cellular functions. It provides keyword and pre-calculated sequence comparison searches. It consists of several interconnected databases: • PATHWAY contains info on metabolic and regulatory networks. • GENES contains information on genes and proteins. • LIGAND contains information on chemical compounds and reactions involved in cellular processes. • EXPRESSION and BRITE contain micro-array gene expression data. • SSDB helps identify protein coding genes. It has anintegrated database retrieval system: DBGET
Biological databases: BINDBiomolecular Interaction Network Databasehttp://www.bind.ca/ • Stores descriptions of interactions, molecular complexes and pathways. • Provides search tools. PreBIND locates literature sources: Show me a list of all of the papers in PubMed that are about my protein of interest. Then classify all of these papers and tell me which ones are likely to contain interaction information. Finally, identify all of the other proteins mentioned in these papers and indicate whether these proteins might interact with my protein of interest Bader GD, et al. BIND The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001 29(1):242-5.
Biological databases: BINDBiomolecular Interaction Network Databasehttp://www.bind.ca/ 3.Its Blast searches BIND database for similarity to a query sequence. BIND is at the forefront of the proteomics efforts and is expected to grow from the large-scale proteomic data. Bader GD, et al. BIND The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001 29(1):242-5.
BIND Statistics Database Record Count Interaction Database 11255 Biomolecular Pathway Database 8 Molecular Complex Database 851 Organisms represented 12 GI Database 4651 DI Database 0 Publication Database 428
Protein family/domain databases: Sequence alignmentPfam(http://www.sanger.ac.uk/Software/Pfam/) • Pfam is a collection of multiple protein sequence alignments and statistical models that can be used to classify protein families and domains. • Descriptions of protein domains: • Given an established SWISSPROT sequence, Pfam shows pre-computed domain structure of the protein. • Given a completely new protein sequence, Pfam computes a domain structure.
Protein family/domain databases: Sequence patterns PROSITE ( http://ca.expasy.org/prosite/ ) • Protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. • It currently contains patterns and profiles specific for more than a thousand protein families or domains. An example of a pattern (motif): W-x(9,11)-[VFY]-[FYW]-x(6,7)-[GSTNE]-[GSTQCR]-[FYW]-x(2)-P
Protein sequence motif databases-PROSITE • A profile is a matrix derived from multiple alignments
Biological data retrieval systems: Entrezhttp://www.ncbi.nlm.nih.gov/Database/index.html • A retrieval system for searching a number of inter-connected databases at the NCBI. It provides access to: • PubMed: The biomedical literature (Medline) • Genbank: Nucleotide sequence database • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets • OMIM: Online Mendelian Inheritance in Man • Taxonomy: organisms in GenBank • Books: online books • ProbeSet: gene expression and microarray datasets • 3D Domains: domains from Entrez Structure • UniSTS: markers and mapping data • SNP: single nucleotide polymorphisms • CDD: conserved domains 2. Entrez allows users to perform various searches.
Biological data Retrieval systems: SRShttp://srs.ebi.ac.uk/ • SRS is a retrieval system for searching several linked databases at the EBI. Similarly to Entrez, it provides access to various databases and enables various keyword, sequence similarity or class searches.
Biological databases: Database searching Database searching can be used to answer the kinds of question like • What is the sequence of human IL-10? • What is the gene coding for human IL-10? • Is the function of human IL-10 known? What is it? • Are there any variants of human IL-10? • Who sequenced this gene? • What are the differences between IL-10 in human and in other species? • Which species are known to have IL-10? • Is the structure of IL-10 known? • What are structural and functional domains of the IL-10? • Are there any motifs in the sequence that explain their properties? • What is an upstream region of IL-10 containing transcriptional regulation sites? • …
Biological databases: Database searching • For well studied molecule such as IL-10, we expect to extract much of the well-known facts. • These searches are useful for characterizing newly identified sequences Notes: • Multiple errors can be found in database entries. Some of these errors are introduced with the submission of sequences to databases. Some errors are due to naming conventions (or lack of these). Some errors are due to poor links between databases. • Users should take data extracted from databases with care and compare these results with information from other databases, journal articles, and other sources.
Biological databases: Keyword searching Search DNA and protein databases with keywords (10-July-2002)
Biological databases: Keyword searching Notes: • GenPept is protein translation of GenBank. SPTR is SWISS_PROT plus protein translation of EMBL sequences. • Different databases contain different, but overlapping, sets of entries. The same sequence may have entries in different databases • Some databases have non-redundant sections. For example the UniGene System which automatically partition GenBank sequences into a non-redundant set of gene-oriented clusters. • For completeness of results usually it is necessary to search multiple databases.
Biological databases: Database coverage • Example: Scorpion KALIOTOXIN 2 (SwissProt:P45696)
Biological databases: Database errors • Our scorpion study (Srinivasan et al., 2002) also revealed numerous errors and missing data in the major databases. One of the entries had an error in sequence in journal publication, but correct sequence in the databases.
Biological Databases: Sequence Similarity Searching Proteins that have similar sequence often have similar structure and similar function. If we have only a protein sequence we can deduce its structural and functional properties by analyzing sequences that are similar. Sequence – Structure – Function Relationship: Similar sequence = Similar structure = Similar function Why this relationship? Evolution: involves sequence variation Laws of physics and chemistry: defines sequence-structure relationship Function as defined by molecular interaction: structure-based
Database Searching: Cautionary Notes • Some database matches happen because of chance similarities, keywords and sequence similarity alike. Distinguishing chance matches from biologically significant matches is one of the most important issues for effective use of biological databases. • Searching GenBank by sequence similarity tool BLAST for short, nearly exact matches, for the sequence similarity to the names of the lecturers of this module in last semester returned two imperfect matches to “VLADIMIR” and seven perfect matches to “TINWEE”.
Database Searching: Cautionary Notes • If we blindly interpret these results, we would erroneously conclude that motif VLADIMIR may have some functional importance for structure or function of the strawberry vein binding virus, and that TINWEE has to do with calciumdependent protein kinase in rice and possibly in Legionella pneumophila. • We would avoid conclusions like this by looking at the similarity scores. This will be done in more detail later in the course, for now it is important to know that the lower the expected value, the better the match. Anything close or greater than 1 should be observed with suspicion. However, sometimes matches that are not statistically significant, still can have biological significance. If we suspect that this might be the case, further analysis is necessary.
Database Searching: Cautionary Notes • Examples of chance matches: virtually any string or keyword can show “matches” to database entries. We are interested only in real ones. • The same search with GenBank. Fortunately we have statistical measures that indicate the quality of matches. However, sometimes matches that have low statistical significance, nevertheless have real, biological significance. More about that will be taught later in the course.
Biological databases: Concluding remarks • Biological databases represent an invaluable resource in support of biological research. 2. We can learn much about a particular molecule by searching databases and using available analysis tools 3. A large number of databases are available for that task. Some databases are very general, some are more specialized, while some are very specialized. For best results we often need to access multiple databases.
Biological databases: Concluding remarks 4. Major types of databases covered in this course are focusing on general nucleotide, general protein, structure, pathways, molecular interactions, protein motifs, publication, and specialized databases. 5. Common database search methods include keyword matching, sequence similarity, motif searching, and class searching. 6. The problems with using biological databases include incomplete information, data spread over multiple databases, redundant information, various errors, sometimes incorrect links, and constant change.
Biological databases: Concluding remarks 7. Database standards, nomenclature, and naming conventions are not clearly defined for many aspects of biological information. This makes information extraction more difficult. 8. Retrieval systems help extract rich information from multiple databases. Examples include Entrez and SRS. 9. Formulating queries is a serious issue in biological databases. Often the quality of results depends on the quality of the queries.
Biological databases: Concluding remarks 10. Statistical measures indicate the quality of matches. Often the statistical and biological significance are related. Sometimes, however matches of real biological significance have low statistical scores. 11. Access to biological databases is so important that today virtually every molecular biological project starts and ends with querying biological databases.
Biological databases Summary of Today’s lecture • Popular databases: KEGG, BIND, Pfam, PROSITE, PUBMED • Data retrieval systems: Entrez, SRS • Database searching: capability, potential problems. • Statistics: • Protein families (> 5K) • Sequence patterns (> 1.5K) • Interactions (>11K or 110 X 110 which is relatively few) • Relatively small amount of data for function (e.g. Pathways < 200)