Text Mining the technology to convert text into knowledge

Text Miningthe technology to convert textinto knowledge Stan Matwin School of Information Technology and Engineering University of Ottawa Canada stan@site.uottawa.ca

Plan • What? • Why? • How? • Who? codata 2002

What? • Text Mining (TM) = Data Mining from textual data • Finding nuggets in otherwise uninteresting mountains of ore • DM = finding interesting knowledge (relationships, facts) in large amounts of data codata 2002

What cnt’d • Working with large corpora • …and little knowledge • Discovering new knowledge • … e.g. in Grimm’s fairy tales • vs uncovering of existing knowledge • …e.g. find mySQL developers with  1yr experience in a file of 5000 CVs • Has to treat data as NL codata 2002

What? Cnt’d • Uncovering aspect of TM • TM = Information Extraction from Text • Text -> Data Base mapping • TM and XML codata 2002

Examples • Extracting information from CVs: skills, systems, technologies etc • Personal news filtering agent • Research in functional genomics about protein interaction codata 2002

Why? • Moore’s law, and… • Storage law codata 2002

How? A combination of • Machine learning • Linguistic analysis • Stemming • Tagging • Parsing • Semantic analysis codata 2002

Some TM-related tasks • Text segmentation • Topic identification and tracking • Text summarization • Language identification • Author identification codata 2002

Two case studies • CADERIGE • Spam detection (with AmikaNow) codata 2002

Caderige « Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interactions Géniques » Knowledge extraction from Natural Language texts codata 2002

Caderige • Objective: to extract information of interest to geneticists from on-line bastract and/or paper databases (e.g. Medline) • Ensure acceptable recall and precision codata 2002

The araR gene is monocistronic, and the promoter region contains -10 and -35 regions (as determind by primer extension analysis) similar to those recognized by RNA polymerase containing the major vegetative cell sigma factor sigmaA. An insertion-deletion mutation in the araR gene leads to constitutive expression of the L-arabinose metabolic operaon. We demonstrate that the araR gene codes for a negative regulator of the ara operon and that the expression of araR is repressed by its own product. The fragment (it.) can be selected by means of keywords codata 2002

This question cannot be answered with keywords alone; semantic knowledge that repression is a type of regulation is req’d It has been proposed that Pho-P plays a key role in the activation of tuA and in the repression of tagA and tagD. "What are the proteins involved in the regulation of tagA?” codata 2002

does not answer After determination of the nucleotide sequence and deduction of the purR reading frame, the PurR product was found to be highly similar to the purR-encoded repressor from Bacillus subtilis. "What are the proteins involved in the regulation of purR?", In fact, parsing is needed to see that PurR and purR-encoded Repressor are objects of the verb to be similar codata 2002

RNA isolated from a sigma B deletion mutant revealed that the transcription of gspA is sigmaB dependent. Conceptual interpretation is needed to see that is an answer to "What are the proteins involved in the regulation of gspA gspA is sigmaB dependent is interpreted as protein sigmaB regulates gspA codata 2002

CADERIGE Architecture codata 2002 Forms matching • • • - fragment selectors - text - Query extraction grammars - Thesaurus - Linguistic resources normalization normalization s conceptual gragrammar text mining extr. extraction using by index resources selection MedLine abstracts of linguistic fragment acquisition labeling query Extraction

3 steps • Focusing: learned filters • Linguistic Analysis: lexicalsyntactic/semantic • Syntax-semantics mapping 3. Extraction codata 2002

Caderige: example codata 2002

Current stage • 1 done • XML for 3 designed • Tools for 2 chosen codata 2002

Email filters • Spam elimination • Automatic filing • Compliance enforcement • …. codata 2002

Email… • The trick: cast it as a text classification problem • Build a training set • train your favouritre classifier • Deploy it codata 2002

State of the art • Current accuracy 80% codata 2002

Difficulties • multi-class problem where • classes overlap • and are hierarchical • recall vs precision codata 2002

TM: who – academically? • David Lewis • Yimin Yang – CMU • Ray Mooney - UT Austin • Nick Cercone - Waterloo • Guy Lapalme – U. de Montréal • TAMALE - University of Ottawa codata 2002

Who – industrially? • Google • Clearforest • AmikaNow codata 2002

Conclusion • Text mining – a necessity (so “!” instead of “?”) • Still in its infancy • Methods must exploit linguistic knowledge codata 2002

Classification • Prevalent practice: examples are represented as vectors of values of attributes • Theoretical wisdom, confirmed empirically: the more examples, the better predictive accuracy codata 2002

ML/DM at U of O • Learning from imbalanced classes: applications in remote sensing • a relational, rather than propositional representation: learning the maintainability concept • Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB codata 2002

Why text classification? • Automatic file saving • Internet filters • Recommenders • Information extraction • … codata 2002

Text classification: standard approach • Remove stop words and markings • remaining words are all attributes • A document becomes a vector <word, frequency> • Train a boolean classifier for each class • Evaluate the results on an unseen sample Bag of words codata 2002

Text classification: tools • RIPPER A rule-based learner Works well with large sets of binary features • Naïve Bayes Efficient (no search) Simple to program Gives “degree of belief” codata 2002

“Prior art” • Yang: best results using k-NN: 82.3% microaveraged accuracy • Joachim’s results using Support Vector Machine + unlabelled data • SVM insensitive to high dimensionality, sparseness of examples codata 2002

SVM in Text classification SVM Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training Transductive SVM Maximum separation Margin for test set codata 2002

Combining classifiers Comparable to best known results (Yang) codata 2002

Text Mining the technology to convert text into knowledge

Text Mining the technology to convert text into knowledge

Presentation Transcript

Text Mining

Introduction to Text Mining

Text mining- text analytics- data mining

Text Mining Overview

Text Mining

Text Mining

Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers

Text Mining

Text Mining

Comparative Text Mining

Text Analysis and Knowledge Mining System

convert voice to text

Convert PDF to text file

Introduction to Text Mining

Text Mining