190 likes | 404 Views
Retrospective study of a gene by mining texts : The Hepcidin use-case. Fouzia Moussouni-Marzolf. Introduction. Life Science is becoming the most VOLUMINOUS science. 3 major reasons :. Modern digital revolution : INTERNET. Increasing incitment to publish : The competition pressure
E N D
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf
Introduction Life Science is becoming the most VOLUMINOUSscience. 3 major reasons : Modern digital revolution : INTERNET • Increasing incitment to publish : • The competition pressure • Evaluation concerns at several levels Sharing of knowledge at a global scale
Introduction Rapid Expansion of the biomedical literature available papers exploding Hepcidin Since dec 2000 The comprehension of iron regulation system is still difficult Comprehension of associated diseases by medical experts BOOM of publications since 2000 Increased demand for effective text mining tools to find quickly relevant information. MLTrends
Introduction These tools extract a deluge of information Very dense data Hepcidin : January 2011 Hepcidin : Febrary 2011 Text Mining with Ali-baba and a global Query « Hepcidin » [1] Many common events few news non expert Information dense and unreadable For an expert A considerable amount of well known data (background). The pertinent information is hidden biologists are rapidly discouraged from using these tools. [1] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. AliBaba: PubMed as a graph. Bioinformatics. 22, 2444-2445 (2006).
Introduction Which solutions for managing this increasing flood of information extracted ? Unfolding time during the process of text mining time Reduce the density of information at each period of time Perception of a certain chronology in the sequence of events linked to a gene: enhance comprehension Ability to locate trivial information repeatedly published and extracted [2] Select the most relevant events over time = Reduced density of information [1] Jensen, L.J., Saric, J. & Bork, P., Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 7, 119-129 (2006).
proteins Different sorts of bio-entities extracted Methods Focus on 2 frames of study 1.Exploit Text Mining Engine Ali-Baba (HU-Berlin) Information Extraction Tool from Medline abstracts resulting from a PubMed Query Hepcidin 2005 [dp] Ali-baba is not a simple pattern matching tool for counting keyword occurrences. It recognizes effective biological entities localized in the abstracts using dictionnaries. Disease Cell Type tissue Drug Specie
Methods Ali-Baba extracts relationships between recognized bio-entities, namely bio-events. …. STAT3 inhibitors, including curcumin, AG490 and a peptide (PpYLKTK), reduced hepcidin1”, …. reduce Curcumin hepcidin1 reduce hepcidin1 AG490 reduce Peptide (PpYLKTK) hepcidin1 Source Entity Relationship Target Entity Biological Events
Methods Abstracts of « Hepcidin 2005 [dp] » Graph of events Extraction of Bio-events Natural Language processing (NLP) Co-occurrence
June 2012 dec2000 time Methods • Focus on Hepcidin gene Corpus of linked biological events published since gene discovery until today Retrospective study of Hepcidin over time period = 1 month Filter trivial bio-events Select relevant bio-entities
e Methods What is a time relevant biological entity ? Definition A biological entity e recognized by an IE based text mining system is time relevant for period t if it achieves at time t a maximum of relationships with other biological entities recognized by the same IE based system. Graph G(Nodes,Edges) of extracted bio-events, e t-relevant biological entity e Highly Targeted by other bio-entities at time t
Target Entity Protein Disease Cell Type tTssue Drug Specie Methods T-Relevance can be computed for different sorts of biological entities Source Entity Relationships Protein Disease Cell Type Tissue Drug Specie Different valuable information for each kind of relevance
Methods What is a trivial biological event at time t ? A trivial event Te = event already published before t G0 = Graph of events at time t0 G1 = Graph of events at time t1 = t0+p G2 = Graph of events at time t2= t0+2p . . . t0+2p t0+p t0+3p TeЄ G2 and (TeЄ G1 or TeЄ G0) TeЄ G1 and TeЄ G0
Ali-baba web-service for Query(t) For each period t in [t0,tn] : Query(t) = « Gene t [dp]" graphML export events extracted and drawn for period t insert GraphML database Clearing of trivial data Selection of t-relevant bio-entities Methods Data Processing Pipeline final retrospective data analysis Data transformation Data stamping integrated time-based events of the decade Data integration
Results Hepcidin Gene Use case - from t0 = 12/2000 to tn = 12/2011 - Database of more than 50,000 published biological events. Considerable amount of trivial events Background ? Cumulative Quantification of trivial events over time 52% of published events on the whole Hepcidin decade are trivial
Results Relevant bio-entities over time Hepcidin Gene Use case Relevant Proteins over time Before clearing trivials Permanent visibility of Hepcidin as relevant After Clearing New information emerge as highly targeted : several proteins regulate Hepcidin Transcription
Results Relevant bio-entities over time Hepcidin Gene Use case Relevant diseases over time Before clearing trivials Permanent visibility of hemochromatosis and iron overload After Clearing New diseases linked to Hepcidin and iron, emerge as highly targeted, like the neurological diseases
Results More annotations of the “relevant entities”
Conclusion A new straightforward approach for retrospective studies of genes has been proposed. Time has been coupled to the process of information extraction to improve comprehension of the considerable amount of biological events linked to a Hepcidin gene since its discovery in dec 2000. This work is still ongoing. Current developments … Toward a generalization to queries of any biological entities Exclude review papers, sections “background” and “methods” from mining to minimize trivial events and entities Threshold of relevance, threshold of triviality
Acknowledgments • Contributors • Bertrand De-Cadeville • Master2 MSB • Olivier Loréal, resp. Iron Ieam • INSERM UMR 991 • Ulf Leser, resp. Bioinformatics Team • HU-Berlin • Astrid Rheinlander • Ali-baba Team at Berlin