540 likes | 1.07k Views
Informatics Journal Club and Research Talk Template. Research Paradigm. Driving Biomedical Problem. Informatics Methods (existing). New Methods. Apply to 2 nd problem area to see generality of new method .
E N D
Research Paradigm Driving Biomedical Problem Informatics Methods (existing) New Methods Apply to 2nd problem area to see generality of new method Evaluate: 1) ability to solve Biomed problem & 2) Incremental improvement from new method
Guide to Talks • Pick journal club paper or research topic that discusses new or improved informatics methods • Make the focus of your talk be a description of the methods. • Describe methods in the context of previous work and in light of evaluations of the methods applied to the biomedical problem
Journal Club: {Title of Paper} by {Authors}{Bibliographic reference} {Your Name} {Date}
BMI Journal Club:Finding function: evaluation methods for functional genomic dataMyers, Barrett, Hibbs, Huttenhower &TroyanaskayaBMC Genomics 2006 7:187 Russ B. Altman 10/4/06
Why this paper? • {Brief bullet points about why this paper is a good BMI journal club paper, and why you selected it}
Outline (part 1) • General description of medical/biological problem • Informatics issues that come up in solving those problems • Additional biological/informatics background • Aims of paper
Outline (part 2) • Methods Employed • Results • Comparison/Evaluation of Methods • Authors Conclusions • Assessment of paper: informatics • Assessment of paper: biomedicine • Concerns • Summary/ Your conclusions
Why this paper? • Needed a good methodological paper • Proliferation of work here and elsewhere on predicting gene function from high throughput genomics • This paper addresses an important problem in evaluation, and uses general informatics principles • Olga is a recent BMI graduate :)
Potentially confounding biomedicine! ;) • {What is application area of biology or medicine in which this work is presented?} • {Discussion of the biological or medical problem that drove/required/suggested researchers to recognize potential for informatics innovation} • {What is the significance of this biomedical problem} • {REMEMBER TO SEPARATE THE INFORMATICS FROM THE BIOMEDICAL APPLICATION. THAT MAY LEAVE NOTHING…}
(Potentially confounding) biomedical background… • With the human genome sequenced, we need to understand the interactions and functions of genes (for understanding, drug-design • High-throughput experimental data sets are used and integrated for this purpose: two-hybrid, mRNA expression, affinity precipitation • Diverse algorithms are also created for integrating these data: • Naïve Bayes (Troyanskaya & others) • Probabilistic Relational Models (Koller) • Comparative techniques (Segal & Stuart)
More biology context… • It is critical to assemble networks of interacting and functionally related genes in order to generate hypotheses about cellular biology, identify drug targets, assess pathway engineering opportunities. • Yeast is the best-studied organism because of the wealth of data sets • Authors suspect that use of existing “silver standards” may skew conclusions about high vs. low information content methods/data sources. • Scientists are frustrated if many predictions are “high confidence” and then fail in the lab.
Informatics Problem • {Describe what is the general biomedical informatics question/problem addressed in the paper} • {Brief review of what others have done to solve this problem, and how performance has been. THIS MAY REQUIRE READING OTHER PAPERS!} • {Why is there another paper on this topic?}
Informatics Problem • Whenever a method is created that makes “predictions” or “diagnoses” it must be evaluated against a gold standard of truth. • When making multiple predictions, there can be biases in the gold standard based on its coverage of the predicted space • The resulting reports of performance can vary widely and unpredictably based on which parts of the gold standard are used. • This is a relatively new problem in the context of large scale predictive technologies
Informatics Problem • What is the best way to evaluate a system making thousands or millions of predictions? • How can we “level the playing field” so that different methods and data sources can be assessed with respect to information content fairly?
Biomedical Context (alternative slide location) • [You may want to address the informatics question first and then raise the medical/biological context, but it often flows better if you start with the biomedical context and use that to motivate the informatics question.]
Background • {Review of informatics and biomedicine people need to know in order to understand the key contributions of the paper}
Background • Gene Ontology • Taxonomy of gene function, 30K+ terms • Terms assigned to genes manually = genes related if they get the same term • KEGG • Database of biological pathways • Mostly metabolic, manually curated • Genes in same pathway = related • Each of these provides a biased coverage of gene function space!
Background • GO is organized from most general (top) to most specific (bottom) • For validation, people often choose a “level” of GO at which they define GO annotations to be “meaningful.” • E.g. All GO codes at level 5 or below = sufficiently precise predictions.
Aims of Paper • {As in BMI 212, a listing of the specific aims of the paper. No more than 3 usually (often less).} • {NOTE: the paper should be presented initially in the most positive light, as the authors would have presented it. The time for critique is after the “author perspective” presentation.}
Aims of Paper • Define the problem of biased gold standards in high-throughput evals. • Create a method for comparing prediction methods fairly • Build a manual gold standard and associated web tool • Allow evaluations to report not only overall performance, but area-specific performance.
Methods Employed • {This is the key part of the presentation for BMI crowd. This should be a presentation of the methods described in the paper at sufficient technical level so people can discuss and evaluate it. Avoid detailed math/equations unless absolutely critical to the discussion.}
Methods Employed • 6 post-doctoral biologists • Examine every GO code and vote on “informative” or “not informative” if applied to a gene • 3 “informative” votes = useful category • <1 “informative” and >1000 annotations = not useful category • “Not usefuls” are key denominator for computations of precision/specificity
Results • {Recapitulate major results. Usually by presenting main figures from the paper.}
Methods • With “gold standard” GO codes that they trust, can now analyze methods/data sources and give specific performance report on different areas (of biology). • Can also systematically remove GO topics in order to see if there are dominant effects (e.g. remove ribosomes)
Authors Conclusions • {A presentation of how the authors summarize their results and significance. Usually not more than 3 major points. Often one.}
Authors Conclusions • Curated GO codes now provide more trustworthy gold-standard • Allows tools to be built that give • Overall performance • Subarea-specific breakdown of performance • Direct comparison of different methods/data sources • Sets the bar on evaluation, and starts a discussion about community-wide standards.
Assessment of Paper: Informatics • {What are the major methodological (engineering) innovations in the paper, in your opinion?} • {Are the methods presented soundly, completely, and evaluated appropriately?} • {How general are the methods presented for use in other areas either directly or with some effort by others?}
Assessment of Paper: Informatics • Beautiful description and justification of the work. Clearly a general problem. • Well informed by research in the field, and evaluation of problems that arise in eval. • Solution applicable in many domains • Close (KEGG, NLP, others) • Farther (Any large volume prediction activity) • Some bias in expert-based gold standards • Very good availability of specific tool to allow use (cf. Maureen)
Assessment of paper: Biomedicine • {Has the paper helped make a new contribution of biomedical knowledge?} • {What is the domain significance of this paper?} • {Was it published in the right journal to find the audience who should care about it the most?}
Assessment of paper: Biomedicine • Should greatly reduce the noise in papers about high-throughput predictions • Should create a new bar for performance • Systems biology and interaction informatics workers need to pay attention. • Microarray information content may be lower than thought previously on average • Genomics audience is a good one, since they need to be aware of these relatively sophisticated informatics issues.
Detailed Concerns • {Particularly if you don’t like the paper, what are your technical informatics concerns about the method, implementation or evaluation?}
Detailed Concerns • A little confused about negative gold standard and how it is meant to be used. (Email in to Olga…) • There are still biases in the gold standard (e.g. GO) by omission that can’t be addressed without more work • What is #2 bad GO area after “ribosome”?… that example is used a lot in the paper.
Summary Conclusions • {Do you accept all of the authors conclusions previously presented?} • {Modified conclusions that you would accept}
Summary Conclusions • Very important paper for evaluation of these methods • Now mandatory for papers in future to address these issues. • Authors aims achieved • Showed the problem • General solution proposed • Specific solution built and disseminated
References • {This paper, and other related papers that a BMI student studying for quals or otherwise interested could review.}
References • Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. Finding function: evaluation methods for functional genomic data. BMC Genomics. 2006 Jul 25;7:187. PMID: 16869964 • Lin N, Wu B, Jansen R, Gerstein M, Zhao H. Information assessment on predicting protein-protein interactions. BMC Bioinformatics. 2004 Oct 18;5:154.PMID: 15491499 • Lee SG, Hur JU, Kim YS. A graph-theoretic modeling on GO space for biological interpretation of gene clusters. Bioinformatics. 2004 Feb 12;20(3):381-8. Epub 2004 Jan 22. PMID: 14960465 • Jansen R, Gerstein M. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol. 2004 Oct;7(5):535-45. PMID: 15451510 • Ben-Hur A, Noble WS. Choosing negative examples for the prediction of protein-protein interactions.BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S2. PMID: 16723005
Acknowledgments • {Thanks to those who contributed to preparation of presentation.} • {Don’t hesitate to contact authors of paper for clarifications. They are usually flattered that you are looking at their paper.}
Acknowledgments • Maureen Hillenmeyer first brought this paper to my attention. • Olga provided a few clarifications that I needed after reading the paper. • BMI-exec encouraged me to do this as an example for how we would like students to select and present BMI JC papers this year.
Thanks. {insert your email address}
Thanks. russ.altman@stanford.edu