170 likes | 302 Views
National Centre for Text Mining. Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community ・ Core Partners University of Manchester: NLP and DM Salford University: Terminology
E N D
National Centre for Text Mining • Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community ・ Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive ・ External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo
Biomedical domain National Centre for Text Mining • Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community ・ Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive ・ External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo
Strategy and Roadmap for TM in Biomedicine Vast number of Google/Yahoo users, satisfied Huge Demand for specialized tools for TM in Bio-Medical Domains Small number of users, unsatisfied The current TM tools, though successful in some business applications, do not meet requirements of users in bio-medical domains. More publicity and marketing More demand-oriented approach What are the requirements for TM for users in bio-medical domains? What technologies should be integrated in future TM for science? Is the nature of TM in scientific fields different from that of business applications?
Effective management of text and knowledge is the key Natural Language Processing Ontology-based KMS Intelligent Text Management System Science: Knowledge Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases)
Retrieval Intelligent Information Retrieval and Question Answering Integration Integration of Text with Data and Knowledge Discovery Text Mining and Knowledge Discovery Intelligent TM systems
Ontology Relationships among concepts Metabolic Pathways Signal Pathways Association between Diseases and Genes …… Motivated Independently of language From Text to Knowledge Non-Trivial Mappings Terminology NLP Paraphrasing Language Domain Knowledge Domain
Examples of Technical Seeds • Term Variants • Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants • Relationships and complex conceptual units are mapped to sentences. • Term Acquisition from Text • New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
Examples of Technical Seeds • Term Variants • Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants • Relationships and complex conceptual units are mapped to sentences. • Term Acquisition from Text • New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
Hypernym Expanded form acronym Synonym NF-kappa B NF kappa B NFKB factor NF-KB NF kB nuclear factor-kappa B nuclear-factor kappa B nuclear factor kappa B nuclear factor κB Nuclear Factor kappa B ……….. Spelling variation
Automatic Generated Term Variants (1) 1.000 NF kappa B 128 0.500 Transcription Factor NF kappa B 0 0.429 NF-kappa B 912 0.286 NF kB 0 0.286 Immunoglobulin Enhancer-Binding Protein 0 0.286 Immunoglobulin Enhancer Binding Protein 0 0.286 Transcription Factor NF-kB 0 0.286 Transcription Factor NF kB 0 0.286 Factor NF-kB, Transcription 0 0.286 nuclear factor kappa beta 2 0.286 NF kappaB 1 0.273 NF kappa B chain 0 0.273 NF kappa B subunit 0 0.214 Transcription Factor NF-kappa B 0 0.214 NF-kB, Transcription Factor 0 0.214 NF-kB 67 0.200 Neurofibromatosis Type kappa B 0
Automatic Generated Term Variants (2) 1.000 tumor necrosis factor A 0 0.316 TNF A 1 0.200 tumor necrosis factor 1653 0.158 TNF alpha 358 0.133 TNFA 32 0.133 TNF 2631 0.133 Tumour necrosis factor alpha 14 0.133 Tumor Necrosis Factor alpha 2 0.133 Tumor Necrosis Factor-Alpha 0 0.133 TUMOR NECROSIS FACTOR.ALPHA 0 0.133 Tumor necrosis factor alpha 52 0.133 Tumor Necrosis Factor-alpha 8 0.133 TNF-Alpha 0 0.133 TNF-alpha 6899
Examples of Technical Seeds • Term Variants • Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants • Relationships and complex conceptual units in the knowledge domain are mapped to sentences in the language domain. • Term Acquisition from Text • New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
Syntactic Variants [A] protein activates [B] (Pathway extraction) Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Non-trivial Mapping Spelling Variants Synonyms Acronyms Same relations with different Structures Language Domain Knowledge Domain Independently motivated of Language
Predicate-argument structureParser based on Probabilistic HPSG (Enju) s vp vp np pp arg2 arg1 mod dt np vp vp pp np DT NN VBZ VBN IN PRP The protein is activated by it
Text Archive with Feature Obejcts Managing texts, data representation and their semantics Semantics Data representation Text ID Data Base Module Copy and Unification Start Position of the region DB of Feature Objects End Position of the region Annotator Content Specialization by unification Text DB Fine grained units of information Context dependency Persistent nature of knowledge and information Ubiquitin E is bound with Text
Demo (The website demo is not available now. )