1 / 53

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes. Andrew Su, Ph.D. @ andrewsu asu@scripps.edu http://sulab.org. OK. January 16, 2014 GMOD 2014. OK. Why am I giving this keynote?. Harnessing the crowd….

gwen
Download Presentation

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org OK January 16, 2014 GMOD 2014 OK

  2. Why am I giving this keynote?

  3. Harnessing the crowd… http://www.flickr.com/photos/portland_mike/6140660504/

  4. … to organize information http://www.flickr.com/photos/45697441@N00/6629580443

  5. My simplified history of MODs

  6. My simplified history of MODs

  7. GMOD is widely used 199 (!) organizations listed as GMOD users

  8. Does the current model scale?

  9. Does the current model scale?

  10. Does the current model scale? # sequenced genomes Year

  11. Does the current model scale?

  12. The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs...

  13. The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs...

  14. At least you can download structured data…

  15. Centralized Model Organism Database concept CMOD

  16. GMOD as a Service (GaaS) http://www.flickr.com/photos/aigle_dore/5626312363/

  17. http://www.flickr.com/photos/shannonmary/187131727/

  18. Few genes are well annotated… GO Annotation Counts 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 41% 20,473 protein-coding genes Genes, sorted by decreasing counts Data: NCBI, February 2013

  19. … because the literature is sparsely curated?

  20. … because the literature is sparsely curated? Number of articles read by typical scientist

  21. 311,696 articles (1.5% of PubMed) have been cited by GO annotations

  22. 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.

  23. The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol

  24. Wikipedia is reasonably accurate

  25. Wikipedia has breadth and depth Articles Words (millions) Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

  26. We can harness the Long Tail of scientists to directly participate in the gene annotation process.

  27. Filtering, extracting, and summarizing PubMed Documents Concepts Review article

  28. Filtering, extracting, and summarizing PubMed Documents Concepts

  29. Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of contributors Number of users

  30. 10,000 gene “stubs” within Wikipedia Utility Users Protein structure Gene summary Contributors Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases Huss, PLoSBiol, 2008

  31. Gene Wiki has a critical mass of readers Utility Total: 4.0 million views / month Users Contributors Huss, PLoSBiol, 2008; Good, NAR, 2011

  32. Gene Wiki has a critical mass of editors Utility Users Contributors Editors Editor count Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011

  33. A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 Hyperlinks to related concepts References to the literature

  34. Making the Gene Wiki more computable Free text Structured annotations

  35. Filling the gaps in gene annotation NCBI Entrez Gene: 334 Wikilink Candidate assertion GO:0006897 GO exact match Gene Wiki mapping 6319 novel GO annotations 2147 novel DO annotations

  36. Gene Wiki content improves enrichment analysis Enrichment analysis axon guidance (GO:0007411) GO term 811 articles PubMed abstracts Concept recognition 264 genes Gene list GO:0007411 Linked genes through PubMed P = 1.55 E-20

  37. Gene Wiki content improves enrichment analysis Enrichment analysis muscle contraction (GO:0006936) GO term 251 articles PubMed abstracts Concept recognition 87 genes Gene list + Gene Wiki 87 articles GO:0006936 GO:0006936 Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0 P = 1.22 E-09

  38. Gene Wiki content improves enrichment analysis p-value (PubMed + GW) More significant PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)

  39. The Long Tail of scientists is a valuable source of information on gene function

  40. Can we skip text mining? http://fiehnlab.ucdavis.edu/projects/rice_metabolome/

  41. Wikidata Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić

  42. Wikidata understands scale

  43. Wikidata understands scale 14 million Wikidata items… …13 million total genes in Entrez Gene

  44. Wikidata understands scale 27 million Wikidata statements… …150k total GO annotations

  45. Wikidata for biology Q414043 Reelin Protein Q8054 is a Property:P31 Glycoprotein Q187126 Neural development Q1345738 Property:P128 regulates VLDL receptor Q1979313 Property:P129 Interacts with Amyloid precursor protein Q423510 http://www.wikidata.org/wiki/Q414043

  46. Wikidata for biology Q414043 Q8054 Property:P31 Q187126 Q1345738 Property:P128 Q1979313 Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

  47. Increasing biological data in Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force

  48. Loading genomic data into Wikidata Entrez Gene Ensembl UniProt UCSC PDB RefSeq

  49. Wikidata gene model Added ~1000 human genes so far….

  50. Wikidata as CMOD? CMOD

More Related