1 / 76

Bioinformatics Databases: Getting Knowledge from Information

Bioinformatics Databases: Getting Knowledge from Information. Kristen Anton Director of BioInformatics Dartmouth Medical School. Bio Informatics @ Dartmouth Medical School. What is Bioinformatics?.

aviv
Download Presentation

Bioinformatics Databases: Getting Knowledge from Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Databases:Getting Knowledge from Information Kristen Anton Director of BioInformatics Dartmouth Medical School BioInformatics @ Dartmouth Medical School

  2. What is Bioinformatics? Bioinformatics provides the backbone computational tools, databases and domain expertise that facilitates modern biomedical, biological and genomic research. BioInformatics @ Dartmouth Medical School

  3. What is Bioinformatics? The expertise is multidisciplinary,and the skills fall on a continuum from‘pure’ science to ‘pure’ computing: • ‘Wet-lab’ science • Sequence analysis • Modeling & structural work • Algorithm development • Clinical and Translational research • Hardware & software infrastructure BioInformatics @ Dartmouth Medical School

  4. With a field this extensive and skill sets so varied, where do we begin? BioInformatics @ Dartmouth Medical School

  5. From Information Design, Nathan Shedroff BioInformatics @ Dartmouth Medical School

  6. Moving from Information to Knowledge to Understanding: Genetic testing • BRCA1 and BRCA2 gene mutations: what is the real risk to women carriers? 25% - 80% • Huntington’s Disease: mechanism defined, but what does that mean for the individual in terms of age of onset, severity of disease, or how disease will progress? BioInformatics @ Dartmouth Medical School

  7. How can Bioinformatics facilitate the extraction of information? • Development of tools that support laboratory experiments • Design, implementation and integration of biological databases • Development of various analytical tools, algorithms and models • Development of systems to collect, validate, manage and integrate clinical and research data to facilitate translational research BioInformatics @ Dartmouth Medical School

  8. Bioinformatics will not replace experiments, but can greatly streamline and enable the discovery process. BioInformatics @ Dartmouth Medical School

  9. One of the fundamental toolsof bioinformatics: Database • A database is a body of information stored in two dimensions (rows and columns) • The power of the database lies in the relationships that you construct between the pieces of information (tables) • SQL (Structured Query Language) - interactive and embedded • Good design and application ensure data integrity • Interoperability BioInformatics @ Dartmouth Medical School

  10. Industry Challenge #1:Genome annotation The Human Genome is sequenced. It is estimated that 2% of the human genome codes for genes. The function of the remaining 98% (non-coding regions) is largely unknown but likely include providing chromosomal structural integrity and regulating where, when, and in what quantity proteins are made. BioInformatics @ Dartmouth Medical School

  11. What does the genome data look like? 1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga 61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc 121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac 181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag 241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc 301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag 361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg 421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg 481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac 541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc 601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac 661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg 721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc 781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc 841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc 901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc 961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa ... Multiply times eighteen million BioInformatics @ Dartmouth Medical School

  12. What does the genome annotation look like today? BioInformatics @ Dartmouth Medical School

  13. BioInformatics @ Dartmouth Medical School

  14. The value of a genome is onlyas good as its annotation • Two steps: annotation & curation • Each genome is annotated individually • Manual curation is standard practice • New tools, ie. NCBI Mapviewer, ESTAnnotator, NCBI Annotation Pipeline • Many databases available … BioInformatics @ Dartmouth Medical School

  15. Nucleic Acids Research article lists1078 public databases (up from 719 in 2005): Nucleic Acids Research, 2008, Vol. 36, Database issue http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2 BioInformatics @ Dartmouth Medical School

  16. Growth in Available Bioinformatics Databases BioInformatics @ Dartmouth Medical School

  17. Industry Challenge #2:Too much unintegrated data • Data sources incompatible • No (or few) standard naming convention • No common interface (varying tools for browsing, querying and visualizing data) BioInformatics @ Dartmouth Medical School

  18. Public Data Resources • “Mandatory” sequence submissions • Cover enormously wide range of informational topics • Broad (sequence) to very specific (proteins associated with tooth decay) issues • No standard database format: poor interoperability, difficulty with integration • Ongoing efforts to address annotation problem BioInformatics @ Dartmouth Medical School

  19. NCBI Database Resources http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=uniprot BioInformatics @ Dartmouth Medical School

  20. Major Sequence Repositories • GenBank All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration • EMBL Nucleotide Sequence Database All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration • DNA Data Bank of Japan (DDBJ) All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration • TIGR/J. Craig Venter Institute Non-redundant, gene-oriented clusters (and many curated microbial genome databases) • UniGene Non-redundant, gene-oriented clusters BioInformatics @ Dartmouth Medical School

  21. Entrez Gene: a unified queryenvironment for genes defined by sequence • Summary/descriptive information • Pubmed entries/bibliography • Interactions • NCBI Reference Sequences (Refseq) • Related sequences • Pathways • Ontologies • Additional likes (e.g. UniGene reference) BioInformatics @ Dartmouth Medical School

  22. GenBank BioInformatics @ Dartmouth Medical School

  23. GenBank Growth BioInformatics @ Dartmouth Medical School

  24. GenBank Growth • 1982 Database contains 606 sequences • Feb 2008 release notes: Database contains more than 82 million sequences - 82853685 (the number of bases approximately doubles every 18 months) • 240,000 different species represented, with new species added at rate of 2900/month • 16% of sequences are of human origin, 13% are human ESTs BioInformatics @ Dartmouth Medical School

  25. Potential Errors in GenBank • Sequence errors estimated at between 0.37 and 35 (!) errors per 1000 bases • Recombination • Contamination • Annotation errors - propagated misannotations • Transfer by similarity is problematic • Errors not always corrected in a timely way • Genes with varying unrelated functions depending on context • Functional annotation is often unsystematic • Name-function disconnect BioInformatics @ Dartmouth Medical School

  26. Potential Errors in GenBank • Naming conflicts • One gene, many acronyms • Many genes, shared acronym • Spelling errors • Cultural differences (US, UK) • Representation of non-ASCII characters BioInformatics @ Dartmouth Medical School

  27. BioInformatics @ Dartmouth Medical School

  28. BioInformatics @ Dartmouth Medical School

  29. BioInformatics @ Dartmouth Medical School

  30. BioInformatics @ Dartmouth Medical School

  31. Also known as • ACTR; AIB1; RAC3; SRC3; pCIP; AIB-1; CTG26; SRC-1; CAGH16; • KAT13B; TNRC14; TNRC16; TRAM-1; MGC141848 BioInformatics @ Dartmouth Medical School

  32. Many Databases available: • Comparative Genomics • Gene Expression • Gene Identification & structure • Genetic Maps • Genomic Databases • Intermolecular Interactions • Metabolic Pathways and Cellular Regulation • Mutation Databases • Pathology • Protein Databases • Protein Sequence Motifs • Proteome Resources • Retrieval Systems & Database Structure • RNA Sequences • Structure • Transgenics • Varied Biomedical Content BioInformatics @ Dartmouth Medical School

  33. The principal requirementson the public data services • Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. Gene Expression • Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases. Genetic Maps • Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database. Intermolecular Interactions • Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. • Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another. BioInformatics @ Dartmouth Medical School

  34. Comparative Genomics: COG • Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequencesencoded in 66 complete genomes, representing 38 major phylogenetic lineages. • Each cluster corresponds to an ancient conserved domain. BioInformatics @ Dartmouth Medical School

  35. Gene Expression BioInformatics @ Dartmouth Medical School

  36. Genetic Maps BioInformatics @ Dartmouth Medical School

  37. Genomic Databases BioInformatics @ Dartmouth Medical School

  38. Intermolecular Interactions BioInformatics @ Dartmouth Medical School

  39. Metabolic Pathways and Celluar Regulation BioInformatics @ Dartmouth Medical School

  40. Mutation Databases BioInformatics @ Dartmouth Medical School

  41. Pathology BioInformatics @ Dartmouth Medical School

  42. Protein Databases BioInformatics @ Dartmouth Medical School

  43. Protein Databases: Swiss-Prot • Extremely well curated protein database • Link to BLAST • Powerful cross-references • Est. 1986 • Maintained by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library BioInformatics @ Dartmouth Medical School

  44. Proteome Resources: Proteome BKL BioInformatics @ Dartmouth Medical School

  45. RNA Sequences BioInformatics @ Dartmouth Medical School

  46. Structure BioInformatics @ Dartmouth Medical School

  47. Varied Biomedical Content BioInformatics @ Dartmouth Medical School

  48. Extinct: Gene Identification & Structure BioInformatics @ Dartmouth Medical School

  49. National Center for Biotechnology Information (NCBI): A network of linked resources • Database access: Genbank, structure, function, SNP, taxonomy... • Literature (PubMed) • Whole genomes • Tools • Contacts & research information • FTP BioInformatics @ Dartmouth Medical School

More Related