1 / 31

Publishing for the 21 st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK

Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego. Publishing for the 21 st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK. Themes. Computers are now partners with humans in reading the literature Search Summarization

tania
Download Presentation

Publishing for the 21 st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK

  2. Themes • Computers are now partners with humans in reading the literature • Search • Summarization • Linking • Discovery • The scientific paper starts with the materials and methods • All observations, claims etc flow from experimental design and materials • If authors do not provide this information in the first place, then we can’t use it to improve all of the above • Scientists produce articles for each other, not for computers • Not everything you need to interpret the paper is in the paper • More information may be there than is in the text

  3. NIF is an initiative of the NIH Blueprint consortium of institutes • What types of resources (data, tools, materials, services) are available to the neuroscience community? • How many are there? • What domains do they cover? What domains do they not cover? • Where are they? • Web sites • Databases • Literature • Supplementary material • Who uses them? • Who creates them? • How can we find them? • How can we make them better in the future? • PDF files • Desk drawers NIF provides a wealth of practical information on data and resource issues in neuroscience http://neuinfo.org

  4. The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience UCSD, Yale, Cal Tech, George Mason, Washington Univ Literature 22 mil • A portal for finding and using neuroscience resources • A consistent framework for describing resources • Provides simultaneous search of multiple types of information, organized by category • Supported by an expansive ontology for neuroscience • Utilizes advanced technologies to search the “hidden web” Data Federation 350 mil Resource Registry 5000 http://neuinfo.org Supported by NIH Blueprint

  5. In an ideal information system, we would be able to find… • What is known • “What studies used my monoclonal mouse antibody against actin in humans?” • “What phenotypes are associated with each mouse model of Spinal Muscular Atrophy” • “What upregulates SMN1?” • What is not known • Connect information to infer plausible hypotheses • Genotype-phenotype • Possible drug targets • Information gaps

  6. Whither biological information? ∞ What is potentially knowable What is known: Literature, images, human knowledge What is easily machine processable and accessible

  7. CA2: Ion, Brain Part or Gene? NIF queries across over 170+ independent databases BioGrid Allen Brain Atlas Brain Info

  8. Papers are the currency of science • Despite the wealth of data out there (> 2500 databases on-line), the majority of data is still published in papers • But...we write for other humans to consume and information continues to be hard to find • Even for humans, however, it is difficult to find and verify basic information about a paper critical for interpretation • What is the subject of the study • What reagents were used • What genes were studied • A lot of information is missing from papers • Not all data is available • Data is published in papers in forms that are difficult to use

  9. Mining the literature for resources • Resources: Materials, services, tools, data • Project 1: Find materials: antibodies and transgenic animals • Project 2: Mine supplemental data in papers showing gene expression changes in drug abuse • Purpose • Find new resources • Track usage of existing resources • Link resources to other useful information

  10. Linking resources: Link out broker

  11. Use case: antibodies • Pilot project to use text mining to identify antibodies used in studies: Wanted to pick a project that would be immediately understandable by research scientists • Antibodies are used routinely to identify proteins and other molecules in basic and translational studies • Antibodies are a large source of experimental variability in results • Same antibody can give you very different results • Different antibodies to the same protein can give you very different results • Neuroscientists spend a lot of time tracking down antibodies and trouble shooting experiments that use antibodies

  12. Our reagents and methods are not perfect “We note that many of the findings in the literature about neuronal NF-κB are based on data garnered with antibodies that are not selective for the NF-κB subunit proteins p65 and p50. The data urge caution in interpreting studies of neuronal NF-κB activity in the brain.” --Herkenham et al., J Neuroinflammation. 2011; 8: 141.

  13. Antibodies are complex entities • Anti-Chat antibody • Raised against a portion of cholineacetyltransferase • Raised in a particular species • Is polyclonal or monoclonal • Is affinity purified or not • Recognizes the target in some species, e.g., human • Reported in materials and methods Tissue sections were blocked with 5% serum and incubated overnight at 4 °C with the following primary antibodies: anti-ChAT (1:100; Millipore, Billerica, MA), anti-Bax (1:50; Santa Cruz), anti-Bcl-xl (1:50; Cell Signaling), anti- neurofilament 200 kDa (1:200; Millipore) ...

  14. “Find studies that used a rabbit polyclonal antibody against GFAP that recognizes human in immunocytochemisty” NIF Antibody Registry: -database of > 900,000 antibodies (AB_310775) Paz et al, J Neurosci, 2010

  15. Searching for resources in literature • NIF recently implemented a section-specific search • Semi-automated resource identification pipeline • Paul Sternberg, Yuling Li, Cal Tech

  16. Annotation of antibodies • Allows annotation of entities and key relationships: • Protocol • Subject of protocol • Links antibodies to a database of antibodies that contains their properties • NIF Antibody Registry • 900,000 antibodies • Unique ID DOMEO annotation tool: Paolo Ciccarese; Tim Clark, MGH http://annotationframework.org/ http://antibodyregistry.org

  17. What studies used my monoclonal mouse antibody against actin in humans? Subject is Human • Midfrontal cortex tissue samples from neurologically unimpaired subjects (n9) and from subjects with AD (n11)were obtained from the Rapid Autopsy Program • Immunoblot analysis and antibodies • The following antibodies were used for immunoblotting:-actinmAb (1:10,000 dilution, Sigma-Aldrich); -tubulinmAb (1:10,000, Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8 mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics); PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb(phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa Cruz Biotechnology)… mAb=monoclonal antibody

  18. Tracking down reagents Feng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36

  19. Space limitationsContentgets separated in space and time • But...electrons are cheap • Cut and paste is cheap • Re-examining plagiarism in the age of cut and paste • Autocomplete is cheap • Acronyms and abbreviations • Are there any unique 3 letter strings • Formats are flexible • What the computer sees and what humans see don’t have to be the same thing Practices are designed to save space, improve readability and save authors typing

  20. Try this Watson! • 95 antibodies were identified in 8 articles • 52 did not contain enough information to determine the antibody used • Some provided details in another paper • And another paper, and another... • Failed to give species, clonality, vendor, or catalog number • But, many provided the location of the vendor because the instructions to authors said to do so

  21. Subject of study • Often not explicit: • “patients with AD” = human • Type III SMA mice (Smn−/−, SMN2+/−) were produced as previously described (Tsai et al., 2006a). • Official strain nomenclature of animals not designed for search • SMN2Ahmb89tg/tg;SMNΔ7tg/tg:Smn1−/−; no unique identifier assigned • Many lines of transgenics are generated and described within a single paper; difficult to relate individual findings with the correct animal line but all are not equivalent Three lines of transgenic mice, Ml, M2, and M3, were produced (Fig. 1B). Transgene expression was found in all tissues studied, with widespread high expression in line Ml, high expression in brain of line M3, and relatively low expression in brain of line M2 (Fig. 1C). (Ripps et al., PNAS, USA Vol. 92, pp. 689-693, January 1995)

  22. Which mouse did you use? • “Transgenic mice expressing SOD1G93A (12) were purchased from Jackson Laboratory” • 12 = Gurney ME; et al. 1994. Motor neuron degeneration in mice that express a human Cu,Zn superoxide dismutase mutation [see comments] [published erratum appears in Science 1995 Jul 14;269(5221):149] Science 264(5166):1772-5. • Search NIF/Jackson lab for “Gurney SOD” • 7 entries for same producer • 3 track to the same reference • Gogliotti et al, BiochemBiophys Res Commun. 2010 January 1; 391(1): 517. • “Here we report our findings for the SMA mouse model that has been deposited by the Li group from Taiwan. These mice, JAX stock number TJL-005058, are homozygous for the SMN2 transgene, Tg(SMN2)2Hung, and a targeted Smn allele that lacks exon 7, Smn1tm1Hung.”

  23. Minimal metadata standards (really) for publishing in the 21st century • 1)   Provide gene accession numbers for all genes referenced in the methods section of a paper, per http://www.ncbi.nlm.nih.gov/gene • 2)   Identify (i.e., give ID) the species for the subject of a study, and from which each gene product is derived, using the NCBI taxonomy and the strains from the model organism databases for mice, rats, worms, zebrafish and drosophila, employing any existing unique identifiers and correct species-specific nomenclature: • 3)   Provide catalog numbers and vendor information for all reagents and animals described in the methods section of a paper • Journal of Comparative Neurology: Requires complete characterization of antibody as stated in instructions to authors • 90% of antibodies had a catalog #; 20% had a lot number after these policies were instituted • NIF could automatically identify 80% of these antibodies through matching with NIF Antibody Registry Developed by the Link Animal Model to Human Disease Initiative (LAMHDI) consortium:

  24. Project 2: Extracting data from tables and supplementary material • Challenge: Extract data on gene expression in brain from studies relevant to drug abuse • Workflow: Find articles Extract results from tables Standardize results Load into NIF Drug related gene database: 140 tables from 54 articles Andrea Arnaud-Stagg, Anita Bandrowski

  25. Extracting additional knowledge from supplementary material Gene for tyrosine hydroxylase has increased expression in locus coeruleus of mouse compared to control when given chronic morphine Translations: Upregulatedp < 0.05 = increased expression LC = locus coeruleus Probe ID = gene name J Neurosci. 2005 Jun 22;25(25):6005-15.

  26. Challenges working with tables and supplemental data • Difficult data arrangements • PDF, JPG, TXT, CSV, XLS • Difficult styles: colors, symbols, data arrangements (results combined into one column, multiple comparisons in one table, legends defining values, unclearly described data (e.g., unclear significance) • Not clear what tables/values represent • nothing in paper about the supplementary data file and table has no heading • Probe ID’s are given but not gene identifiers • No link from supplemental material back to article; lose provenance • Not all results are accounted for

  27. Is SMN1 affected by drugs of abuse? SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease of children

  28. Open world vs closed world assumptions • Closed world assumption: • holds that any statement that is not known to be true is false • allows an agent to infer, from its lack of knowledge of a statement being true, anything that follows from that statement being false • typically applies when a system has complete control over information • Open world assumption: • the assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. • limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true • the open world assumption applies when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information.

  29. Reporting data: Closing the open world • We measured the expression of 9000 genes as a function of chronic cocaine (S1). The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2 • What about the other 8950 genes? • Cannot assume that they were increased, decreased or remained the same (Open world) • We measured the expression of 9000 genes as a function of chronic cocaine (S1). The fold change and p value are given for each gene.The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2(Closed world)

  30. Narrative vs Data publishing • Narrative (Author): Encourage use of minimal standards for key entities in the research paper • Subject, protocol, genes, reagents • Make it easy to find accession numbers • Standard templates for reporting supplemental data? • Unlikely although desired • Tools for linking in line references to fragments of papers rather than the entire paper • Data (Curators): Structuring data requires expertise • Positive and negative results equally important • If data are to be published in supplemental material or in paper, should make them machine interpretable • Ideally, entire data set should be deposited in a public repository, e.g., GEO OMNIBUS

  31. Conclusions • Humans are storytellers; it’s fundamental to the way we communicate • But these stories are directed to an audience with expertise • Scientists know each other’s work; personal networks very important- • The computer isn’t part of this • So...we need to adapt publishing practices to aid automated search and mining of content • Partnership between authors, publishers, curators and computer scientists, informaticians... • Future of research communications and e-scholarship • http://force11.org JOIN US!

More Related