240 likes | 257 Views
Biosemantics group. Martijn Schuemie. Overview. The biosemantics group Ontology assembly Concept tagging Homonym disambiguation Concept profile creation Nucleolus. Biosemantics group. ErasmusMC University Medical Center Rotterdam Department of Medical Informatics Biosemantics group
E N D
Biosemantics group Martijn Schuemie
Overview • The biosemantics group • Ontology assembly • Concept tagging • Homonym disambiguation • Concept profile creation • Nucleolus
Biosemantics group • ErasmusMC University Medical Center Rotterdam • Department of Medical Informatics • Biosemantics group • Jan Kors • Barend Mons • Erik van Mulligen • Martijn Schuemie • Rob Jelier • Kristina Hettne • Antoinne van Veldhoven
Biosemantics group Biosemantics • Molecular Biology • High througput experiment data (genomics and proteomics) • Gene and protein databases, MEDLINE, Gene Ontology Biosemantics • Concept-based text-mining • Interpretation of experiment data • Knowledge discovery
Entrez Gene Swiss-Prot HUGO Combination P=37%, R=76% ABC1 -> ABC-1 DEF3 -> DEF-III Add spelling variations CO2, membrane-bound obesity, open reading frame Remove highly ambiguous terms P=50%, R=75% Ontology assembly
Malaria fever is a disease. It is spread by mosquitos. MEDLINE text Sentence splitting [Malaria fever is a disease.] [It is spread by mosquitos.] [Malaria] [fever] [is] [a] [disease] Tokenization Word normalisation [malaria] [fever] [be] [a] [disease] Concept mapping [malaria fever] C24530 [disease] C12634 PSA -> Prostate Specific Antigen or Poultry Science Association? Homonym disambiguation Concept profile of text Concept tagging
Homonym disambiguation • Some simple rules: • Is it likely that a term has multiple meanings? • - 3-letter-acronym (e.g. PSA): highly likely • - long forms (e.g. Prostate Specific Antigen): highly unlikely • - terms that refer to several concepts by definition • Is a synonym found? (e.g. “KLK3 (PSA)”) • Is a keyword found? (e.g. “PSA is secreted by the prostate”) • These simple rules change performance from P=50%, R=75% to P=71%, R=71%.
Homonym disambiguation Concept profile of Prostate Specific Antigen Similarity? Concept profile of text containing PSA Concept profile of Phosphoserine Aminotransferase Unknown meaning Previous tests showed an overall accuracy of 93%
From databases • By concept mapping Text Text Text Concept Concept profile of text Concept profile of text Concept profile of text Concept profile of concept Concept profile creation
Concept profile creation Uncertainty cf. X IDF Log likelihood Binary
Concept profile creation Profile of gene ESR1: estrogen receptor 1 breast neoplasm 0.5 BRCA1 0.34 PGR 0.30 Estrogen 0.28 BRCA2 0.25 TP53 0.15 gene suppressor tumor 0.12 genetics polymorphism 0.12 genetic predisposition to disease 0.10 female 0.05
Nucleolus • main function: ribosome biogenesis • over 700 proteins identified and classified into 8 main categories
Concept profile of text Concept profile of text Concept profile of text Concept profile of protein Nucleolus – Concept profiles • From databases MEDLINE article MEDLINE article MEDLINE article Protein
Nucleolus – Concept profiles BLAST (Basic Local Alignment Search Tool) Query: nucleolar protein • Results: homologs in • human • mouse • fruitfly • yeast
Nucleolus – fun with protein profiles • 2D visualization of high-dimensional space • Automatic functional annotation of proteins • Finding similar proteins
Nucleolus - visualisation Exosome comp. 10 P98179 O43390 SRP PARN Multi-Dimensional Scaling Q8N220
Concept profile of text Concept profile of text Concept profile of text Concept profile of GO term Nucleolus – Assigning GO terms • From GO MEDLINE article MEDLINE article MEDLINE article GO term
Nucleolus – Assigning GO terms AuC : Area under Curve
Nucleolus – Assigning GO terms ‘Mistakes’ in automatic annotation • Manual assignment to one category only • e.g. SFRS protein kinase 1plays a role in splicing, • but is also in kinase • Assumptions do not always hold • Sequence homology ≠ function homology • Concept co-occurrence ≠ functional relationship • Homonyms
Nucleolus – Finding new proteins Concept profile of human protein Concept profile of nucleolar protein Concept profile of human protein Concept profile of human protein
Nucleolus – Finding new proteins 60S ribosomal protein L3-like Probable ATP-dependent RNA helicase DDX4 ATP-dependent RNA helicase DDX3Y Guanine nucleotide binding protein-like 3 Importin-11 (importin beta family) Putative Brix domain containing protein 1P Probable ATP-dependent RNA helicase DDX20 (Gemin 3) 60S acidic ribosomal protein P0 Helicase SKI2W ATP-dependent RNA helicase DDX39 40S ribosomal protein S20 Probable ATP-dependent RNA helicase DDX6 Probable ATP-dependent RNA helicase DDX23 Double-stranded RNA-binding protein Staufen homolog 1 ATP-dependent RNA helicase DDX25 Probable nucleolar complex protein 14 Eukaryotic initiation factor 4A-II ATP-dependent RNA helicase DDX19B 40S ribosomal protein S3 Ribosomal protein DEAD-box DEAD-box Found in nucleolus Associated with nucleolar p. DEAD-box DEAD-box DEAD-box Found in nucleolus DEAD-box Ribosomal protein DEAD-box DEAD-box Indirect evidence DEAD-box Nucleolar DEAD-box DEAD-box Ribosomal protein