1 / 31

Issues in Managing and Disseminating Changing Information in Biology

Issues in Managing and Disseminating Changing Information in Biology. Sue Rhee Rhee@acoma.stanford.edu Carnegie Institution Department of Plant Biology Stanford, CA. Information Dissemination Media in Biology. Journals ~150 years peer-reviewed highly referenced limited size static.

moya
Download Presentation

Issues in Managing and Disseminating Changing Information in Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Managing and Disseminating Changing Information in Biology Sue Rhee Rhee@acoma.stanford.edu Carnegie Institution Department of Plant Biology Stanford, CA

  2. Information Dissemination Media in Biology Journals ~150 years peer-reviewed highly referenced limited size static Public Repositories ~20 years minimum review minimum reference unlimited size static Community Databases ~5 years Curator-review Moderately referenced unlimited size dynamic

  3. TAIR:the Arabidopsis Information Resource • A Community Database about Arabidopsis Information • Researchers can search, download, analyze data via commonly-used web browsers and ftp • NSF funded project (1999-2004) • Collaboration between Carnegie (Stanford, CA), NCGR (Santa Fe, NM) and ABRC (Columbus, OH) • http://www.arabidopsis.org

  4. Who are the users? People Groups Organism of Interest Total: 12,300 inviduals and 4700 labs working on plant research

  5. Monthly: ~5 million files served ~900,000 page views ~29,000 IP addresses ~30 Gb served Usage Statistics

  6. What do we do? • Capture data generated by large genome projects and individual researchers • Read and extract info from literature, establish contact with large-scale project groups • Curate and analyze the information • Error checking, making associations, synthesizing summary, adding quality control filters through a series of standard operation procedures and analysis pipelines • Make information accessible to users in intuitive form • In-house biologists and user feedback from surveys & workshops • Develop data query, analysis, curation, visualization tools • Collaboration between software developers and biologists, iterative process • Communicate with the users • Data submssion, suggestions, error and other problem reports

  7. What is PubSearch? • A web application and database for literature curation • Stores complete literature information • References, abstracts, full text articles (pdf) • Stores biological information • Genes, proteins, descriptions • Stores ontologies (GO Terms) • Links literature, GO terms and biological information. • Assists manual curation with fast, automatic matching (using suffix trees indicer) • Is password-protected, and easy to set up and use.

  8. PubSesarch System Architecture

  9. TAIR Installation Statistics (9/12/03) • 20,272 literature references • 14,920 research papers with abstracts • 8,642 full-text papers (58%) • 16,956 controlled vocabulary terms • 105,671 hits between terms and articles (2359 terms) • 38,010 gene names • 29,841 hits between genes and articles (4268 genes) • 14,943 hits validated • (70% valid, 29% not valid, 0.5% maybe) • 11,497 manual annotations to 5981 genes from 2113 articles • 38 relationship types for gene2term and gene2gene • 103 evidence types

  10. Pub* Tools Website: http://pubsearch.org

  11. TAIR Data Size

  12. Current Issues in Community Databases • How to maximize connection with public repositories and journals? • How to ensure information is up-to-date? • How to cross-reference all the information in independent sites? • What happens after the funding?

  13. Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases

  14. Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases

  15. Making Connections with Public Repositories • Utilizing existing standards • LinkOut • Data capture includes Genbank accession (e.g. seed stock containing an insertion and the insert-site sequence with Genbank accession) • Data downloaded from Genbank using the accession using e-utilities • Data curation/analysis generates additional associations (e.g. the insertion site used to identify the associating gene and a polymorphism for that gene) • Sequence-associated information sent back to Genbank using the LinkOut XML format • B. MIAME standards for microarrays • Researchers submit microarray data in prefilled Excel sheets • Convert Excel into XML and load into TAIR database • Data curation/analysis generates additional associations (e.g. usage of controlled vocabularies) • Data exported into XML and sent to ArrayExpress • 2. Collaborating to make new standards • Plant microarray submission standards with ArrayExpress

  16. Making Connections with Journals • Publication requirement to adhere to existing standards • Stock Accessions • Gene symbol Registry (currently under discussion) • Data sharing • Image data for gene expression • Supplementary data (e.g. microarray results) • Resource sharing • A. Publication through community databases?

  17. Keeping Information Up-To-Date • In-house curation • -pro: experience and standard operation procedures can ensure consistency • -con: becoming difficult keep up as the amount and complexity of information increases • Community involvement • -pro: expertise and sheer number of the community • -con: has not worked successfully (no incentive in the current academic reward structure, not considered to be a typical role of a scientist) • Others?

  18. Impact Factor of Top Journals

  19. Impact Factor of Top Databases?

  20. Impact of TAIR

  21. Current Issues in Community Databases • How to maximize connection with public repositories and journals? • How to ensure information is up-to-date? • How to cross-reference all the information in independent sites? • What happens after the funding?

  22. The End

  23. People Involved TAIR-Carnegie Tanya Berardini Marga Garcia-Hernandez Eva Huala Suparna Mundodi Leonore Reiser Julie Tacklind Iris Xu Danny Yoo Peifen Zhang Nick Moseyko Brandon Zoekler Jessie Zhang TAIR-NCGR Dan Weems Neil Miller Mary Montoya ABRC Randy Scholl Debbie Crist Emma Knee Luz Rivero

  24. Information Dissemination Media in Biology • Scientific Journals • Traditional medium of knowledge dissemination • Long history of publishing • Recently have move to electronic publishing • 2. Public Repositories • Permanent operations for electronic storage and dissmination of basic data • Shorter history than journals, about 20 years • A good example is NCBI’s Genbank • 3. Community Databases • Information resources that are created, maintained, and improved by research community • Funded by governments, not permanent. • A few large databases share similar history as public repositories • Recently there has been a radiation of the community databases

  25. What is the infrastructure? Web browser applications FTP Directory DVD archive Application Program Interface Analysis cluster Data object layer TAIR DB Software Development, Curation, Testing, Staging Environments

More Related