530 likes | 629 Views
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes. Andrew Su, Ph.D. @ andrewsu asu@scripps.edu http://sulab.org. OK. January 16, 2014 GMOD 2014. OK. Why am I giving this keynote?. Harnessing the crowd….
E N D
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org OK January 16, 2014 GMOD 2014 OK
Harnessing the crowd… http://www.flickr.com/photos/portland_mike/6140660504/
… to organize information http://www.flickr.com/photos/45697441@N00/6629580443
GMOD is widely used 199 (!) organizations listed as GMOD users
Does the current model scale? # sequenced genomes Year
The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs...
The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs...
GMOD as a Service (GaaS) http://www.flickr.com/photos/aigle_dore/5626312363/
Few genes are well annotated… GO Annotation Counts 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 41% 20,473 protein-coding genes Genes, sorted by decreasing counts Data: NCBI, February 2013
… because the literature is sparsely curated? Number of articles read by typical scientist
311,696 articles (1.5% of PubMed) have been cited by GO annotations
0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
Wikipedia has breadth and depth Articles Words (millions) Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
We can harness the Long Tail of scientists to directly participate in the gene annotation process.
Filtering, extracting, and summarizing PubMed Documents Concepts Review article
Filtering, extracting, and summarizing PubMed Documents Concepts
Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of contributors Number of users
10,000 gene “stubs” within Wikipedia Utility Users Protein structure Gene summary Contributors Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases Huss, PLoSBiol, 2008
Gene Wiki has a critical mass of readers Utility Total: 4.0 million views / month Users Contributors Huss, PLoSBiol, 2008; Good, NAR, 2011
Gene Wiki has a critical mass of editors Utility Users Contributors Editors Editor count Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 Hyperlinks to related concepts References to the literature
Making the Gene Wiki more computable Free text Structured annotations
Filling the gaps in gene annotation NCBI Entrez Gene: 334 Wikilink Candidate assertion GO:0006897 GO exact match Gene Wiki mapping 6319 novel GO annotations 2147 novel DO annotations
Gene Wiki content improves enrichment analysis Enrichment analysis axon guidance (GO:0007411) GO term 811 articles PubMed abstracts Concept recognition 264 genes Gene list GO:0007411 Linked genes through PubMed P = 1.55 E-20
Gene Wiki content improves enrichment analysis Enrichment analysis muscle contraction (GO:0006936) GO term 251 articles PubMed abstracts Concept recognition 87 genes Gene list + Gene Wiki 87 articles GO:0006936 GO:0006936 Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
Gene Wiki content improves enrichment analysis p-value (PubMed + GW) More significant PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
The Long Tail of scientists is a valuable source of information on gene function
Can we skip text mining? http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
Wikidata Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
Wikidata understands scale 14 million Wikidata items… …13 million total genes in Entrez Gene
Wikidata understands scale 27 million Wikidata statements… …150k total GO annotations
Wikidata for biology Q414043 Reelin Protein Q8054 is a Property:P31 Glycoprotein Q187126 Neural development Q1345738 Property:P128 regulates VLDL receptor Q1979313 Property:P129 Interacts with Amyloid precursor protein Q423510 http://www.wikidata.org/wiki/Q414043
Wikidata for biology Q414043 Q8054 Property:P31 Q187126 Q1345738 Property:P128 Q1979313 Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Increasing biological data in Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
Loading genomic data into Wikidata Entrez Gene Ensembl UniProt UCSC PDB RefSeq
Wikidata gene model Added ~1000 human genes so far….
Wikidata as CMOD? CMOD