640 likes | 770 Views
Ariadne Genomics technology: Extraction from the literature and network analysis Dr. Anton Yuryev Ariadne Genomics Inc. Pathway Studio desktop Pathway Studio workgroup Pathway Studio enterprise Main functionality: Data mining and pathway building Analysis of high-throughput data
E N D
Ariadne Genomics technology: Extraction from the literature and network analysis Dr. Anton Yuryev Ariadne Genomics Inc.
Pathway Studio desktop Pathway Studio workgroup Pathway Studio enterprise Main functionality: Data mining and pathway building Analysis of high-throughput data Text-mining, fact extraction and database building Pathway Studio product line
Ariadne Corporate OfferingSoftware solution for Knowledge management and pathway analysis of the high-throughput data MedScan 1000 abstracts/min Knowledge Databases Pathway Building Pathway collection Proprietary data ResNet Biological Association Networks Public interaction data Analysis of High-Throughput data Text-mining
Automatic fact extraction by MedScan from organism-specific subset of PubMed and full-text journals Import of Ariadne proprietary curated data Curated physical interaction 712signaling line pathways Import of publicly available curated interaction data: Entrez Gene, BIND, HPRD, KEGG, Gene Ontology Import of publicly available high-throughput interaction data (Y2K, Mass-spec etc) Import of user proprietary data: Proprietary or publicly available experimental data in PSI, BioPax or Tab-delimited formats Data mined by MedScan tool from literature sources not included with database User manual curation Ariadne Database Construction
> 130 KEGG metabolic pathways >70 STKE pathways (AAAS) >10,000 ERGO pathways for 587 organisms (Integrated genomics) >100,000 protein interactions from Hynet (Prolexys) >600 disease pathways PathArt (Jubilant) Additional Commercial datasets
Web-client for instant pathway publishing Connection between multiple geographical sites 3-tier architecture with Java API to connect third party applications and algorithms MedScan Enterprise license: open MedScan dictionaries and pattern rules files for customization distribution of MedScan data across entire company GSEA, NEA and network clustering algorithms for analysis of high-throughout data Pathway Studio Enterprise distinctions
Pathway Studio Enterprise Architecture Read-only users via web browser Application server Database Data editors via web browser Third party tools, in-house applications, API SQL interface, bulk data management Bioinformaticians via Pathway Studio
“Everyone is an Expert” decentralized deployment schema Hundreds or thousands of users some with read only and some with editor or publishers roles accessing one central database via Pathway Studio and/or Web browser to analyze experiments, browse pathway collection, do literature mining, sharing the data and analysis results.
“Bioinformatics service group” centralized deployment schema Bioinformatics group servicing scientists for entire company by analyzing their experimental data and literature mining. Analysis results are published via Web browser interface for end users End users View only access to pathways and analysis networks annotated with experimental data via web browser and links to PathwayExpert Web Services • Experimental data • Search requests • Analysis of experimental data • Text-mining and Pathway Building Bioinformatics group
“Disease area” decentralized clusters deployment schema Disease area groups have bioinformatics, biologists and chemists working as a team with focus on one disease Cardiovascular group Cancer group Digestive disorders group CNS group
Text-mining, fact extraction and database building Stay current with the literature Build focused literature networks Build focus databases Data mining and pathway building Understand molecular mechanisms of disease and processes Maintain pathway collection Build focus databases Analysis of high-throughput data Functional ontology analysis Network analysis Plan of the talk
Sentence in PubMed: “Axin binds beta-catenin and inhibits GSK-3beta.” Identify Proteins in Dictionary (in red): “Axin binds beta-cateninand inhibits GSK-3beta.” Identify Interaction Type (in black): “Axinbindsbeta-cateninand inhibitsGSK-3beta.” Extracted Facts: Axin - beta-catenin relation: Binding Axin -> GSK-3beta relation: Regulation, effect: Negative How MedScan extracts facts from text?
Filtering by Number of references controls the network confidence in Pathway Studio Binding (references: 77) Owner: public, Entities E2F1-RB1 This stabilization of the pRB-E2F-1 complex by AAV expression in adenoviral-infected cells should lead to a decrease in E2F-1- mediated expression of cell cycle-specific genes.
MedScan Architecture Customizable by user Modules Entity recognizer Entity detection Dictionaries Toxicology Drosophila Mammals C-elegans Yeast Plants RNEF XML Semantic processor Rules Relationship extraction Pattern matcher Patterns Cartridges • Future: • New modules: ConceptScan • New cartridges: Immunology, Clinical
Manually curated: dictionaries and grammar rules Fast: 14 mln PubMed abstracts in 2 days on modern PC Comprehensive: facts recovery rate > 90% 90% = 70% sentence recovery rate + 20% literature redundancy Removes redundancy:7,647,282 non-distinct relations =>1,000,000distinct relations Accurate: false positive rate – 10% Customizable: dictionaries and patterns Describing MedScan
MedScan Applications Indexing the scientific literature Pubmed Entity-based indexSemantic Index Google MedScan Open access Extracting interactions to create databases for systems biology Automatic reader’s digest Document Summary
Manual Automatic using Graph navigation tools Using text-mining with MedScan Pathway Building in Pathway Studio
Viewing entities in the List Pane Entity and relation tables Show all references Pathway Reference summary Export protein list Display styles: By type, By effect, By reference count UI options: magnifier fit text to entities simple and full graph view fit to window rotate move zoom by rectangle advanced graph scaling resizing nodes in pathway pane Viewing and editing pathways in Pathway Studio
Pathway Building by text-mining Non-melanoma skin cancer >1,000,000 cases, (<2,000 deaths), in USA
MedScan Reader: PubMed search Keep searching and adding relations At the end Send extracted relations to Pathway Studio
MedScan Reader: Import top 100 Hits from Google Scholar search: downloads found articles and processes them with MedScan
MedScan Reader: Import top 30 Hits from Google search: downloads found web-pages and processes them with MedScan
Full-text article found on Highwire press with “non-melanoma skin cancer” text search
MedScan customization by focused literature source:“Nonmelanoma skin cancer” literature network – result of targeted text-mining by MedScan Reader • Every entity in this network was mentioned in the context of non-melanoma skin cancer: • Find hubs • Compare with patient data
MedScan customization by focused literature source:Protein network for non-melanoma skin cancer Compare this pathway with your experimental patient data
Automatic Pathway Building using Graph navigation Build pathway tool
Basic principal Regulatory interactions are mediated by physical interaction network Regulomes Biological processes pathways Disease networks Mining regulatory relations in database
Regulome pathways: Connecting IL10 targets with physical interaction relations
Building pathways by Data miningconverting regulatory network to protein physical interaction network for Cell Processes, Diseases, Regulomes
Disease networks2300 diseases, 230 cancers in ResNet 5.0 databaseconverting regulatory network to protein physical interaction network for Diseases Endothelial cells cancer
Applied information retrieval and multidisciplinary research: new mechanistic hypotheses in Complex Regional Pain SyndromeJ Biomed Discov Collab. 2007; 2: 2. Kristina M Hettne, Marissa de Mos, Anke GJ de Bruijn, Marc Weeber, Scott Boyer, Erik M van Mulligen, Montserrat Cases, Jordi Mestres, and Johan van der Lei Resulting network of CRPS concepts
High-throughput data analysis in Pathway Studio • Identification of responsive genes • Functional ontology analysis • Network analysis
Gene expression Metabolomics Proteomics SNP and CNV analysis Methylation arrays Phosphorylation arrays Supports analysis of all types of experiment data • Support for all microarray platforms: • Affymetrix • Agilent • Illumina • Nimblegen • Superarray • Custom design chips
Expression data import (tab, xls, cel) Selection of responsive genes Find differentially expressed genes (significance analysis via t-test) Gene clustering via correlation networks Find responsive genes in the third party software for statistical analysis of microarray data and import it as a protein list (Tools->Import protein list) Analysis of gene expression microarray data: STEP 1: Identification of responsive genes
Calculation of differentially expressed genes in Pathway Studio (significance analysis using paired and unpaired t-tests)
Network analysis Identification of DE expressed protein complexes and physical networks Identification of major regulators and targets in expression network Via network querying (Build pathway tool) Via Network enrichment analysis (in PS Enterprise only) Functional analysis Comparison of responsive genes with ontologies and pathway collection Via Fisher exact test Via Gene Set Enrichment analysis (GSEA in PS Enterprise only) Gene ontology analysis (via Fisher’s test or GSEA) Comparative gene ontology analysis Via network querying(Build pathway tool) Analysis of gene expression microarray data: STEP 2: Pathway Analysis of responsive genes
Functional analysis: comparative GO groups analysiscomparing cell responses in GO group space
Building protein network from interesting GO groups and identification of its major expression regulator
High-throughput data analysis in Pathway Studio • Functional ontology analysis • Network analysis
Expression Interpretation of Gene Expression data PromoterBinding DirectRegulation Interpretation of Proteomics data ProtModification Binding Interpretation of Metabolomics data, Biomarkers prediction and validation MolSynthesis MolTransport Regulation Data model in ResNet databaseFormalized representation of biological regulatory and interaction network …MORE….
Network analysis: identification of major regulators and targets among DE genes via Build pathway
Network analysis: Identification of major regulatorsNetwork enrichment analysis Finds regulators with most differentially expressed targets Better Worse