400 likes | 471 Views
EcoliWiki and GONUTS. Wiki-based Systems for Community Annotation Jim Hu Dept. of Biochemistry and Biophysics Texas A&M University. Overview. EcoliWiki and the central problem in genome annotation Gene Ontology and the Gene Ontology Normal Usage Tracking System (GONUTS)
E N D
EcoliWiki and GONUTS Wiki-based Systems for Community Annotation Jim Hu Dept. of Biochemistry and Biophysics Texas A&M University
Overview • EcoliWiki and the central problem in genome annotation • Gene Ontology and the Gene Ontology Normal Usage Tracking System (GONUTS) • Live demos/Discussion
Annotation • Goals for annotation: • Coverage • Accuracy • Usefulness • for scientists (human-readable) • for machine inference generation (computer-understandable) • Annotation is a moving target!
People are limiting for annotation • Major genome databases employ large numbers of people • This model problematic • Curators are expensive • NIH and NSF cannot afford to staff every organism at this level • Broad expertise across all areas is hard • Curators have to read papers in areas they were not trained in. • Curators may not recognize the significance of papers in areas they were not trained in • Can we make it: • cheaper? • faster? • better?
The Wikipedia approach • Get your user community to work for free! • aka "Community annotation" or "Community curation"
EcoliWiki http://ecoliwiki.org or .net or .com (most of our hits come from Google)
“What is true of Escherichia coli is true of the elephant” - Jacques Monod “Thanks to annotation creep, what’s false for E. coli is false for the elephant too” - Jim Hu “What is true of Escherichia coli is true of the elephant” - Jacques Monod “Thanks to annotation creep, what’s false for E. coli is false for the elephant too” - Jim Hu http://www.pasteur.fr/infosci/archives/mon/im_ele.html
EcoliWiki philosophy • Any registered user can edit • Any registered user can register new users • Any registered user can create new pages • It's easier to revise than to create new content • Seed content from other sites, mostly EcoCyc • Any registered user can edit • Any registered user can register new users • Any registered user can create new pages • It's easier to revise than to create new content • Seed content from other places, mostly EcoCyc
GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. "That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. "It would be chaos." GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. "That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. "It would be chaos." But won't that invite chaos?
Correct compared to what? NCBI RefSeq: Wikipedia:
Correct compared to what? NCBI RefSeq: Wikipedia:
Correct compared to what? NCBI RefSeq: Wikipedia:
This is how biology achieves fidelity A collage of books I haven’t read
Participation is the major challenge • Anyone can edit ≠ Anyone will edit • Wikipedia: a tiny fraction of the users edit anything • A tiny fraction of those do major editing • Really big denominator • Outreach to increase our user base
Participation is the major challenge • Tools to make it easier to edit
Biggest difference from other systems: Partial annotations are wanted It doesn't matter if you don't know the wiki markup It doesn't matter if what you're adding isn't fully worked out Someone else can fix it And you can fix what others write Participation is the major challenge
Making it machine-friendly:ontologies • Ontology: • in philosophy: a metaphysical system for studying being • In biology/bioinformatics: a structured representation of biological knowledge • NCBO = National Center for Biological Ontologies • OBO = Open Biological Ontologies • Examples • MESH • Sequence ontology = SO • Phenotype and trait ontology = PATO • Gene Ontology = GO • see the EBI ontology browser: http://www.ebi.ac.uk/ontology-lookup/
What is an ontology? • Controlled vocabulary with • Term identifiers • GO:0000075 • Name • cell cycle checkpoint • Definitions • "A point in the eukaryotic cell cycle where progress through the cycle can be halted until conditions are suitable for the cell to proceed to the next stage." [GOC:mah, ISBN:0815316194] • Relationships • is_a GO:0000074 ! regulation of progression through cell cycle • Terms arranged in a Directed Acyclic Graph (DAG)
Pros and Cons of Ontologies • Pros • facilitate comparison across systems • facilitate computer based reasoning systems • Good for data mining! • Cons • Large and unwieldy • Difficult to understand • Difficult to use • May never capture knowledge accurately • Ontology development lags behind the field it tries to capture • Example of a theme of genomics: imperfect tools can still be very powerful!
is_a part_of GO = Gene Ontology • 3 ontologies for gene products • Biological Process • Molecular Function • Cellular Component • Used to make annotations • aka Gene associations • Term + qualifiers + evidence code + reference etc. figure from GO consortium presentations from GOC
Cellular Component • where a gene product acts figure from GO consortium presentations from GOC
Cellular Component figure from GO consortium presentations from GOC
Molecular Function • activities or “jobs” of a gene product glucose-6-phosphate isomerase activity figure from GO consortium presentations from GOC
Molecular Function insulin binding insulin receptor activity figure from GO consortium presentations from GOC
Molecular Function • A gene product may have several functions • Sets of functions make up a biological process. figure from GO consortium presentations from GOC
cell division Biological Process a commonly recognized series of events figure from GO consortium presentations from GOC
Biological Process transcription figure from GO consortium presentations from GOC
GO annotation • Find papers • Read them • Find what genes are mentioned • What assertions are made about the product? • What GO terms are applicable? • GO term browsers • Amigo http://amigo.geneontology.org/cgi-bin/amigo/go.cgi • GONUTS http://gowiki.tamu.edu • New term needed? • What evidence code should be used to record the assertion? • Record gene associations in the MOD database • Send gene associations to GO consortium • Downloadable files that users doing electronic analysis can parse
Human vs Electronic GO annotations • What is the basis for making a gene association? • Human • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assay • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis • Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement • Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • Automatically-assigned Evidence Codes • IEA: Inferred from Electronic Annotation
GONUTs (http://gowiki.tamu.edu) • Started as a wiki-based usage guide • Each ontology term is a MW Category • MW supports DAGs as Categories! • Each term page has a notes area for user notes on usage • term pages list examples of genes that were annotated to this term
MOD gene pages • Gene pages from established Model Organism Databases provide examples of best practices
User-created gene pages • Annotation pages based on UniProt IDs
Supporting Annotation Jamborees in Cyberspace • RefGenome subgroup of GO Consortium • collaboration on annotation consistency • Electronic Jamborees via teleconference • Uses GONUTS to collect and compare
Supporting Annotation Jamborees in Cyberspace • RefGenome subgroup of GO Consortium • collaboration on annotation consistency • Electronic Jamborees via teleconference • Uses GONUTS to collect and compare
EcoliWiki/GONUTS Team Nathan Liles Brenley McIntosh Debby Siegele Daniel Renfro Anand Venkatraman Adrienne Zweifel GO consortium EcoliHub Team Leaders Barry Wanner PI, Purdue Walid Aref, co-PI, Purdue Tyrell Conway, co-PI, Oklahoma Mike Gribskov, co-PI, Purdue Peter Karp, co-PI, SRI Daisuke Kihara, co-PI, Purdue Funding NIH U24-GM077905 Thanks to URLs: http:ecolihub.org http:ecoliwiki.org http:gowiki.tamu.edu