300 likes | 459 Views
Extracting information from scientific papers:. Challenges and Opportunities for Researchers and Curators. DPB. Discussion Plan. What does a curator do? What do we ALL ( researches and curators ) want from the papers we read? What problems do we encounter when reading papers?
E N D
Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB
Discussion Plan • What does a curator do? • What do we ALL (researches and curators) want from the papers we read? • What problems do we encounter when reading papers? • Identifying items • Choosing annotations • How can we work together to improve these processes? • Why does this matter to YOU?
What does a curator do? • It depends on the type of curator! • Functional genomics curator / Metabolic pathway curator: • Help to maintain the TAIR and Plant Metabolic Network / AraCyc websites • Answer questions from users • Give presentations and workshops at conferences and universities • Interact with curators at other institutions to develop better curation practices and tools • Read LOTS of papers
What do we all want from papers? • It depends on the type of paper! • I focus on papers that describe: • genes/proteins (TAIR and PMN) • metabolic pathways (PMN) • We all want the important information! • Curators also want to be able to capture that information and display it for users on the TAIR and AraCyc/PMN websites.
What do we all want from papers? • What gene / protein are they talking about? • AGI locus code (TAIR / PMN) • At2g46990 • Gene symbol and FULL names (TAIR / PMN) • BSK3 = Brassinsteroid (BR)-signaling kinase 3 • GGT2 = Glutamate:Glyoxylate aminotransferase 2 • Gene model (TAIR) • At2g46990.1
What do we all want from papers? • What does this gene do? • Molecular Function GO terms (TAIR) • has “protein kinase activity” - GO: 0004672 • functions in “histone binding” - GO: 0042393 • has “L-glutamine transmembrane transporter activity” - GO:0015186 • Phenotype description (TAIR) • “The ppc4-2 mutant has reduced PEP carboxylase activity” • Reactions catalyzed (PMN) • indole-3-acetonitrile + 2 H2O = ammonia + indole-3-acetate (IAA) • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)
What do we all want from papers? • Where is this protein found? • Cellular Component GO terms (TAIR) • located in “nucleolus” - GO:0005730 • located in “TOC complex” - GO:0010006 • Cellular Ontology (PMN) • chloroplast • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)
What do we all want from papers? • When and where is this gene / protein expressed? • Plant Structure PO terms (TAIR) • expressed in “anther” - PO:0009066 • Plant Growth Stages PO terms (TAIR) • expressed during “expanded cotyledon stage” - PO:0001078 • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)
What do we all want from papers? • What biological processes does this protein participate in? • Biological Process GO terms (TAIR) • involved in “petal development” - GO:0048441 • involved in “L-glutamate import” - GO:0051938 • involved in “brassinosteroid biosynthetic process” - GO:0016132 • Metabolic Pathways (PMN) • put enzyme in “alanine degradation” pathway • Phenotype descriptions • “The phot1-4 mutant shows reduced responses to blue light” • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)
What do we all want from papers? • What mutant(s) did they describe? (TAIR) • Mutant ID • SALK_nnnnnn • SAIL_21_A07 • Mutant name and unique symbol • rte1-2 (reversion-to-ethylene-sensitivity 1-2) • Ecotype • Ploidy level (e.g. heterozygous, homozygous) • Phenotype description
What do we all want from papers? • What experiments did they do? • Assay conditions and reagents • Help curators • make GO and PO annotations (TAIR) • identify enzymatic reactions (PMN) • specific substrates, e.g. L-glutamate • necessary co-factors, e.g. Mg2+ • capture pH and temperature optimums (PMN) • We don’t capture: • PCR primers • good antibody sources • etc. • . . . but you are welcome to submit this information using “Comments”
What do we all want from papers? A lot of important information . . . Gene identity Gene function Gene expression patterns and much more! Have you ever read a paper that’s missing important information? How did that make you feel? Did it interfere with your ability to do your work?
Challenges : Identifying Objects • Case 1: Paper describes a gene or genes using a symbol • Authors never provide AGI code, sequence information, or other unique ID • Different genes can have the same symbols in TAIR • ASA: • Attenuated shade avoidance? • Anthranilate Synthase Alpha Subunit? • ARF1 • Auxin Response Factor 1? • ADP-Ribosylation Factor 1? • Not all symbols are in TAIR • Authors describe a new mutant or name a new gene family and never give IDs • Impossible for us to annotate / Impossible for you to do related experiments
Challenges : Identifying Objects • Case 2: Paper does not specify gene model when appropriate • a. “The T-DNA insertion is in the third exon of TPK1” Which “third exon?” Which “TPK1?” b. “We expressed TPK1 in E.coli and saw activity” Which “TPK1?” c. “A TPK1:GFP fusion protein localizes to the nucleus”
Challenges : Identifying Objects • Case 3: Not enough information is given about a mutant • “The phyb mutant had a longer hypocotyl than the wild type plant” • 30 alleles / germplasms associated with phyB in TAIR Which phyb? What ecotype?
Challenges : Identifying Objects • Case 4: Not enough information is given about enzymatic reactions • Diagram in paper shows: arogenate tyrosine • “In vitro, AR dehydrogenase catalyzed the formation of tyrosine from arogenate” D- or L-form of amino acid? What other substrates or products are involved? What oxidizing agent is involved? • “We detected the formation of arabidiol” What is the chemical structure of “arabidiol?”
Opportunities : Identifying Objects • You are the next generation of: • Authors • Reviewers • Journal Editors • You can help each other and curators to identify all the important items in the • manuscripts you write or review • AGI locus code for all genes in paper (At2g46990) • Gene model information when relevant (At2g46990.1) • Specific mutant names (abc1-7), IDs (SALK_nnnnn) and ecotype • Complete and balanced biochemical reactions • Chemical structures or chemical database IDs for compounds • But, for curators, identifying objects is only one of the challenges . . .
Challenges : Choosing annotations • Curators have to make decisions . . . • When should we make annotations? • What specific annotations should we make? • You should be concerned about how we “choose” annotations • You are data providers • We’re capturing the data from your papers • How would you like to see it presented? • You are data users • You use our annotations of individual genes • You analyze your microarray data using our GO and PO annotations • You view your transcript and metabolomic data using the OMICs viewer • How would you like to see it presented?
Challenges : Choosing annotations – YOU make the call! • When and what should we annotate using GO terms?
Challenges : Choosing annotations – YOU make the call! • Case 1: When is something “involved in” a biological process? • Molecular Function and Cellular Component annotations – pretty clear • Biological Process can be pretty ambiguous! • Glycine metabolic process • 6 mutants are uncovered that have altered levels of glycine • lgl1-1, lgl2-1, lgl3-1 make “Less GLycine” than wild-type plants • mgl1-1, mgl2-1, mgl3-1 make “More GLycine” than wild-type plants • Annotate all 6 genes: involved in “glycine metabolic process” • Use evidence code: IMP = inferred from mutant phenotype
Challenges : Choosing annotations – YOU make the call! • Which genes are “involved in” – glycine metabolic process? ? MGL1 = F-box protein (E3 ligase subunit) degrades kinase ? MGL2 = phosphatase promotes E3 ligase activity ? LGL3 = tyrosine kinase turns on TF ? MGL3 = nucleoporin up-regulates enzyme allows phosphatase to enter nucleus ? LGL2 = transcription factor ? LGL1 = threonine aldolase ? ? ? • Where do we stop? • Should we change old annotations? (***Evidence code is important – be aware of IMP!) • What belongs in a GO annotation versus a phenotype description?
Challenges : Choosing annotations – YOU make the call! • Case 2: How do we deal with over-expressers? RNAi? etc.? • What biological process is XYZ1 involved in? • 35S:XYZ1 • more petals than wild type plants • xyz1 KO mutants • normal number of petals • Is XYZ involved in “petal development?” • XYZ1 is “only expressed in roots” • XYZ1 is “expressed at very low levels in flowers” • XYZ1 – no expression data mentioned • What if XYZ is part of a large gene family? • What if XYZ is unique (not related to other genes)? ? ? ? ? ?
Challenges : Choosing annotations – YOU make the call! • Case 3: When is it “enough” to make an annotation? • JKL is expressed in “rosette leaves” • “RT-PCR analyses show expression of JKL in rosette leaves” • “JKL is expressed at low levels in rosette leaves” • “JKL expression is barely detectable in rosette leaves” • GHI has enzymatic activity with the following substrates in vitro: • Which Molecular Functions do we annotate with GO in TAIR? • Which reactions do we add to AraCyc? • IAA + isoleucine -> IAA-Ile (90%) • IAA + leucine -> IAA-Leu (50%) • IAA + histidine -> IAA-His (20%) • IAA + cysteine -> IAA-Cys (5%) • IAA + proline -> IAA-Pro (1%) ? ? ? ? What if the reactions are characterized in vivo? ? ? ? ? ?
Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support • Which genes are “expressed in” these tissues?
Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support ? ? ? • “The expression of 11 genes was detected in leaves.”
Challenges : Choosing annotations – YOU make the call! Case 5: Which term is “most” appropriate? • GRI (Grim Reaper) is involved in the regulation of extracellular ROS-induced cell death • “gri plants show increased ROS-induced cell death and reduced seed content.“ • “The seed content in siliques was reduced in gri and GRI overexpressors compared with Col-0 and vector control.“ Are the siliques shorter? ? • involved in “fruit development” Are there empty spaces in normal siliques? ? • involved in “seed development” Wrzaczek et al 2009
Opportunities : Choosing annotations – YOU make the call! • You can be the annotators of the future! • informally : e-mail us or drop by and say hello! • use TAIR or PMN submission forms • during journal publication process • Plant Physiology (now) • more journals in the future!
Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators • We all read papers • We all want to extract important and useful information from papers • We all want reliable annotations in our databases • Challenges: • Sometimes it is difficult to find the information we need in papers • Sometimes it is hard to judge how to curate data in papers • Opportunities: • Authors, reviewers, and editors can make sure that papers have adequate information • Curators can help researchers to directly submit annotations to TAIR or the PMN • Curators and researchers can communicate about the curation process • You know what we want • We know what you want! • We all work together to advance scientific research!
Thank you! TAIR, AraCyc, and the PMN Eva Huala (Director and Co-PI) Sue Rhee (PI and Co-PI) Tech Team Members: - Bob Muller (Manager) - Larry Ploetz (Sys. Administrator) - Raymond Chetty - Anjo Chi - Vanessa Kirkup - Cynthia Lee - Tom Meyer - Shanker Singh - Chris Wilks Metabolic Pathway Software: - Peter Karp and SRI group Current Curators: - Tanya Berardini (lead curator – functional annotation) - David Swarbreck (lead curator – structural annotation) - Peifen Zhang (Director and lead curator- metabolism) - A. S. Karthikeyan (curator) - Philippe Lamesch (curator) • Donghui Li (curator) • Rajkumar Sasidharan (curator) Recent Past Contributors: - Debbie Alexander (curator) - Christophe Tissier (curator) - Hartmut Foerster (curator) NSF