240 likes | 257 Views
Explore biosemantics group's ontology assembly, concept tagging, and homonym disambiguation techniques for efficient concept profiling and text mining. Learn about nucleolus biosemantics and concept mapping for knowledge discovery.
E N D
Biosemantics group Martijn Schuemie
Overview • The biosemantics group • Ontology assembly • Concept tagging • Homonym disambiguation • Concept profile creation • Nucleolus
Biosemantics group • ErasmusMC University Medical Center Rotterdam • Department of Medical Informatics • Biosemantics group • Jan Kors • Barend Mons • Erik van Mulligen • Martijn Schuemie • Rob Jelier • Kristina Hettne • Antoinne van Veldhoven
Biosemantics group Biosemantics • Molecular Biology • High througput experiment data (genomics and proteomics) • Gene and protein databases, MEDLINE, Gene Ontology Biosemantics • Concept-based text-mining • Interpretation of experiment data • Knowledge discovery
Entrez Gene Swiss-Prot HUGO Combination P=37%, R=76% ABC1 -> ABC-1 DEF3 -> DEF-III Add spelling variations CO2, membrane-bound obesity, open reading frame Remove highly ambiguous terms P=50%, R=75% Ontology assembly
Malaria fever is a disease. It is spread by mosquitos. MEDLINE text Sentence splitting [Malaria fever is a disease.] [It is spread by mosquitos.] [Malaria] [fever] [is] [a] [disease] Tokenization Word normalisation [malaria] [fever] [be] [a] [disease] Concept mapping [malaria fever] C24530 [disease] C12634 PSA -> Prostate Specific Antigen or Poultry Science Association? Homonym disambiguation Concept profile of text Concept tagging
Homonym disambiguation • Some simple rules: • Is it likely that a term has multiple meanings? • - 3-letter-acronym (e.g. PSA): highly likely • - long forms (e.g. Prostate Specific Antigen): highly unlikely • - terms that refer to several concepts by definition • Is a synonym found? (e.g. “KLK3 (PSA)”) • Is a keyword found? (e.g. “PSA is secreted by the prostate”) • These simple rules change performance from P=50%, R=75% to P=71%, R=71%.
Homonym disambiguation Concept profile of Prostate Specific Antigen Similarity? Concept profile of text containing PSA Concept profile of Phosphoserine Aminotransferase Unknown meaning Previous tests showed an overall accuracy of 93%
From databases • By concept mapping Text Text Text Concept Concept profile of text Concept profile of text Concept profile of text Concept profile of concept Concept profile creation
Concept profile creation Uncertainty cf. X IDF Log likelihood Binary
Concept profile creation Profile of gene ESR1: estrogen receptor 1 breast neoplasm 0.5 BRCA1 0.34 PGR 0.30 Estrogen 0.28 BRCA2 0.25 TP53 0.15 gene suppressor tumor 0.12 genetics polymorphism 0.12 genetic predisposition to disease 0.10 female 0.05
Nucleolus • main function: ribosome biogenesis • over 700 proteins identified and classified into 8 main categories
Concept profile of text Concept profile of text Concept profile of text Concept profile of protein Nucleolus – Concept profiles • From databases MEDLINE article MEDLINE article MEDLINE article Protein
Nucleolus – Concept profiles BLAST (Basic Local Alignment Search Tool) Query: nucleolar protein • Results: homologs in • human • mouse • fruitfly • yeast
Nucleolus – fun with protein profiles • 2D visualization of high-dimensional space • Automatic functional annotation of proteins • Finding similar proteins
Nucleolus - visualisation Exosome comp. 10 P98179 O43390 SRP PARN Multi-Dimensional Scaling Q8N220
Concept profile of text Concept profile of text Concept profile of text Concept profile of GO term Nucleolus – Assigning GO terms • From GO MEDLINE article MEDLINE article MEDLINE article GO term
Nucleolus – Assigning GO terms AuC : Area under Curve
Nucleolus – Assigning GO terms ‘Mistakes’ in automatic annotation • Manual assignment to one category only • e.g. SFRS protein kinase 1plays a role in splicing, • but is also in kinase • Assumptions do not always hold • Sequence homology ≠ function homology • Concept co-occurrence ≠ functional relationship • Homonyms
Nucleolus – Finding new proteins Concept profile of human protein Concept profile of nucleolar protein Concept profile of human protein Concept profile of human protein
Nucleolus – Finding new proteins 60S ribosomal protein L3-like Probable ATP-dependent RNA helicase DDX4 ATP-dependent RNA helicase DDX3Y Guanine nucleotide binding protein-like 3 Importin-11 (importin beta family) Putative Brix domain containing protein 1P Probable ATP-dependent RNA helicase DDX20 (Gemin 3) 60S acidic ribosomal protein P0 Helicase SKI2W ATP-dependent RNA helicase DDX39 40S ribosomal protein S20 Probable ATP-dependent RNA helicase DDX6 Probable ATP-dependent RNA helicase DDX23 Double-stranded RNA-binding protein Staufen homolog 1 ATP-dependent RNA helicase DDX25 Probable nucleolar complex protein 14 Eukaryotic initiation factor 4A-II ATP-dependent RNA helicase DDX19B 40S ribosomal protein S3 Ribosomal protein DEAD-box DEAD-box Found in nucleolus Associated with nucleolar p. DEAD-box DEAD-box DEAD-box Found in nucleolus DEAD-box Ribosomal protein DEAD-box DEAD-box Indirect evidence DEAD-box Nucleolar DEAD-box DEAD-box Ribosomal protein