310 likes | 440 Views
Issues in Managing and Disseminating Changing Information in Biology. Sue Rhee Rhee@acoma.stanford.edu Carnegie Institution Department of Plant Biology Stanford, CA. Information Dissemination Media in Biology. Journals ~150 years peer-reviewed highly referenced limited size static.
E N D
Issues in Managing and Disseminating Changing Information in Biology Sue Rhee Rhee@acoma.stanford.edu Carnegie Institution Department of Plant Biology Stanford, CA
Information Dissemination Media in Biology Journals ~150 years peer-reviewed highly referenced limited size static Public Repositories ~20 years minimum review minimum reference unlimited size static Community Databases ~5 years Curator-review Moderately referenced unlimited size dynamic
TAIR:the Arabidopsis Information Resource • A Community Database about Arabidopsis Information • Researchers can search, download, analyze data via commonly-used web browsers and ftp • NSF funded project (1999-2004) • Collaboration between Carnegie (Stanford, CA), NCGR (Santa Fe, NM) and ABRC (Columbus, OH) • http://www.arabidopsis.org
Who are the users? People Groups Organism of Interest Total: 12,300 inviduals and 4700 labs working on plant research
Monthly: ~5 million files served ~900,000 page views ~29,000 IP addresses ~30 Gb served Usage Statistics
What do we do? • Capture data generated by large genome projects and individual researchers • Read and extract info from literature, establish contact with large-scale project groups • Curate and analyze the information • Error checking, making associations, synthesizing summary, adding quality control filters through a series of standard operation procedures and analysis pipelines • Make information accessible to users in intuitive form • In-house biologists and user feedback from surveys & workshops • Develop data query, analysis, curation, visualization tools • Collaboration between software developers and biologists, iterative process • Communicate with the users • Data submssion, suggestions, error and other problem reports
What is PubSearch? • A web application and database for literature curation • Stores complete literature information • References, abstracts, full text articles (pdf) • Stores biological information • Genes, proteins, descriptions • Stores ontologies (GO Terms) • Links literature, GO terms and biological information. • Assists manual curation with fast, automatic matching (using suffix trees indicer) • Is password-protected, and easy to set up and use.
TAIR Installation Statistics (9/12/03) • 20,272 literature references • 14,920 research papers with abstracts • 8,642 full-text papers (58%) • 16,956 controlled vocabulary terms • 105,671 hits between terms and articles (2359 terms) • 38,010 gene names • 29,841 hits between genes and articles (4268 genes) • 14,943 hits validated • (70% valid, 29% not valid, 0.5% maybe) • 11,497 manual annotations to 5981 genes from 2113 articles • 38 relationship types for gene2term and gene2gene • 103 evidence types
Current Issues in Community Databases • How to maximize connection with public repositories and journals? • How to ensure information is up-to-date? • How to cross-reference all the information in independent sites? • What happens after the funding?
Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases
Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases
Making Connections with Public Repositories • Utilizing existing standards • LinkOut • Data capture includes Genbank accession (e.g. seed stock containing an insertion and the insert-site sequence with Genbank accession) • Data downloaded from Genbank using the accession using e-utilities • Data curation/analysis generates additional associations (e.g. the insertion site used to identify the associating gene and a polymorphism for that gene) • Sequence-associated information sent back to Genbank using the LinkOut XML format • B. MIAME standards for microarrays • Researchers submit microarray data in prefilled Excel sheets • Convert Excel into XML and load into TAIR database • Data curation/analysis generates additional associations (e.g. usage of controlled vocabularies) • Data exported into XML and sent to ArrayExpress • 2. Collaborating to make new standards • Plant microarray submission standards with ArrayExpress
Making Connections with Journals • Publication requirement to adhere to existing standards • Stock Accessions • Gene symbol Registry (currently under discussion) • Data sharing • Image data for gene expression • Supplementary data (e.g. microarray results) • Resource sharing • A. Publication through community databases?
Keeping Information Up-To-Date • In-house curation • -pro: experience and standard operation procedures can ensure consistency • -con: becoming difficult keep up as the amount and complexity of information increases • Community involvement • -pro: expertise and sheer number of the community • -con: has not worked successfully (no incentive in the current academic reward structure, not considered to be a typical role of a scientist) • Others?
Current Issues in Community Databases • How to maximize connection with public repositories and journals? • How to ensure information is up-to-date? • How to cross-reference all the information in independent sites? • What happens after the funding?
People Involved TAIR-Carnegie Tanya Berardini Marga Garcia-Hernandez Eva Huala Suparna Mundodi Leonore Reiser Julie Tacklind Iris Xu Danny Yoo Peifen Zhang Nick Moseyko Brandon Zoekler Jessie Zhang TAIR-NCGR Dan Weems Neil Miller Mary Montoya ABRC Randy Scholl Debbie Crist Emma Knee Luz Rivero
Information Dissemination Media in Biology • Scientific Journals • Traditional medium of knowledge dissemination • Long history of publishing • Recently have move to electronic publishing • 2. Public Repositories • Permanent operations for electronic storage and dissmination of basic data • Shorter history than journals, about 20 years • A good example is NCBI’s Genbank • 3. Community Databases • Information resources that are created, maintained, and improved by research community • Funded by governments, not permanent. • A few large databases share similar history as public repositories • Recently there has been a radiation of the community databases
What is the infrastructure? Web browser applications FTP Directory DVD archive Application Program Interface Analysis cluster Data object layer TAIR DB Software Development, Curation, Testing, Staging Environments