770 likes | 971 Views
NLP for Text Mining. Towards systems capable of extracting semantic information from texts Presentation by: Tiago Vaz Maia. Introduction. Keyword-models of text are very poor (e.g. search with google). There is great advantage to a system that ‘understands’ texts, at some level.
E N D
NLP for Text Mining Towards systems capable of extracting semantic information from texts Presentation by: Tiago Vaz Maia
Introduction • Keyword-models of text are very poor (e.g. search with google). • There is great advantage to a system that ‘understands’ texts, at some level. • Need for semantic understanding.
CRYSTAL (UMass) • CRYSTAL: Inducing a Conceptual Dictionary S. Soderland, D. Fisher, J. Aseltine, W. Lehnert In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1995.
Information Extraction Systems • Generally domain-specific (e.g. medical records, financial news). • Work by having “a dictionary of linguistic patterns that can be used to identify references to relevant information in a text”. Op. Cit.
A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit.
Rationale for CRYSTAL • Building domain-specific dictionaries is time-consuming (knowledge-engineering bottleneck). • CRYSTAL builds such dictionaries automatically, from annotated texts (supervised learning).
Annotation of Texts • E.g. “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. • Domain expert marks “shortness of breath” and “swollen ankles” with CN type “sign or symptom” and subtype “present”. (Example from Op. Cit.)
CRYSTAL’s Output • A dictionary of information extraction rules (i.e. concept-nodes) specific for the domain. • These rules should be general enough to apply to other texts in the same domain.
Algorithms • Next time… • Five minutes go by very fast!
Conclusions • Domain-specific information extraction systems capable of semantic understanding are within reach of current technology. • CRYSTAL makes such systems scalable and easily portable by automating the process of dictionary construction.
CRYSTAL’s Dictionary Induction Algorithms Tiago V. Maia
Review: Information Extraction • Information Extraction Systems extract useful information from free text. • This information can, for example, be stored in a database, where it can be data-mined, etc. • E.g., from hospital discharge reports we may want to extract:
Review: Information Extraction • Name of patient. • Diagnosis. • Symptoms. • Prescribed treatments. • Etc.
Review: Dictionary of Concept-Nodes • In the UMass system extraction rules are stored in concept-nodes. • The set of all concept-nodes is called a dictionary.
A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit.
Review: CRYSTAL • CRYSTAL automatically induces a domain-specific dictionary, from an annotated training corpus.
Learning Algorithm • Two steps: • 1. Create one concept-node per positive training instance. • 2. Gradually merge different concept-nodes to achieve a more general and compact representation.
Constructing Initial CN’s • E.g. of step 1: • Sentence: “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. • Annotation: “shortness of breath” and “swollen ankles” are marked with CN type “sign or symptom”, subtype “present”.
Initial CN Definition CN-type: Sign or Symptom Subtype: Present Extract from Prep. Phrase “WITH” Verb= <NULL> Subject constraints: words include “UNREMARKABLE” Prep. Phrase constraints: words include “THE EXCEPTION OF MILD SHORTNESS OF BREATH AND CHRONICALLY SWOLLEN ANKLES” head class <Sign or Symptom> Op. Cit.
Need for Induction • Initial concept-nodes are too specific to be useful for any texts other than the training corpus. • One needs an inductive step, capable of constructing more general definitions Step 2.
Inducing General CN’s • Main idea: Merge sufficiently similar CN’s, until doing more merges starts generating too many errors. • How do we merge similar CN’s? • The goal is to obtain a general CN that ‘covers’ both CN’s and provides a good generalization for unseen cases.
Merging CN’s • The unification of two CN’s is found by relaxing the constraints in such a way that they cover both nodes. • Word constraints: Intersection of the word constraints from each CN. E.g.: Verb constraints: • “vehemently denies” • “denies categorically” • “denies”
Merging CN’s • Semantic class constraints: Found by moving up the semantic hierarchy. E.g.: Prep. phrase constraints: • head class <Sign or Symptom> • head class <Lab or Test Result> • head class <Finding>, if in the semantic hierarchy we have:
Semantic Hierarchy Finding Sign or Symptom Lab or Test Result
Evaluating Merges • Every merged CN is tested against the training corpus. If its error rate is above a certain threshold, it is discarded. • The system continues merging CN’s until no more can be merged without resulting in a CN whose error rate exceeds the pre-specified tolerance.
Results • MUC-3: dictionary built using 1500 hours of work by two advanced graduate students and one post-doc. • MUC-4: using Autoslog, a precursor of CRYSTAL, dictionary was built using 8 hours of work by a first-year graduate student! • Both dictionaries presented roughly the same functionality.
Conclusions • Automated induction of domain-specific information extraction dictionaries is very good alternative to hand-coding. • Knowledge engineering effort drastically reduced, allowing for widespread real-world applications.
Combining Information Extraction and Data Mining Tiago V. Maia
“Using Information Extraction to Aid the Discovery of Prediction Rules from Text” U.Y. Nahm and R.J. Mooney In: KDD-2000 Workshop on Text Mining
Text IE DB KDD Rules An Approach to Text Mining
The Application • Step 1. Starting from free-text job postings in a newsgroup, build a database of jobs. • Step 2. Mine that database, to find interesting rules.
Sample Job Posting • Sample job posting: • “Leading Internet Provider using cutting edge web technology in Austin is accepting applications for a Senior Software Developer. The candidate must have 5 years of software development, which includes coding in C/C++ and experience with databases (Oracle, Sybase, Informix, etc.)…”
Sample Job Record • Title: Senior Software Developer • Salary: $70-85K • City: Austin • Language: Perl, C, Javascript, Java, C++ • Platform: Windows • Application: Oracle, Informix, Sybase • Area: RDBMS, Internet, Intranet, E-commerce • Required years of experience: 5 • Required degree: BS
Sample Extracted Rule • “If a computer-related job requires knowledge of Java and graphics then it also requires knowledge of Photoshop”
Information Extraction • Uses RAPIER: a system similar to CRYSTAL, that also constructs the extraction rules automatically, from an annotated training corpus.
Rule Induction • The induced rules predict the value in a database field, given the values in the rest of the record. • Each slot-value pair is treated as a distinct binary feature. E.g., Oracle Application. • An example of an induced rule: HTML Language Windows NT Platform Active Server Pages Application Database Area
Algorithms for Rule Induction • Uses C4.5. • Decision trees are learned using the binary representation for slot-value pairs, and pruned to yield the rules.
Conclusions • Text mining can be achieved by information extraction followed by the application of standard KDD techniques to the resulting structured database. • Both IE and KDD are well understood, and their combination should yield practical real-world systems.
Learning Probabilistic Relational Models Getoor, L., Friedman, N., Koller,D. and Pfeffer, A. Invited contribution to the book Relational Data Mining, Dzeroski, S. and Lavrac, N. (Eds.), Springer-Verlag, 2001.
Applicability Text IE DB KDD JPD
Probabilistic Relational Models • Probabilistic relational models are a marriage of: 1. Probabilistic graphical models, and 2. Relational models.
Why a Probabilistic Model? • Probabilistic graphical models (e.g. Bayesian networks) have proven very successful for representing statistical patterns in data. • Algorithms have been developed for learning such models from data. • Because what is learnt is a joint probability distribution, these models are not restricted to answering questions about specific attributes.
Why a Relational Model? • Typically, Bayesian networks (or graphical models in general) use a representation that consists of several variables (e.g. height, weight, etc.) but has no structure. • The most common way of structuring data is in a relational form (e.g. relational databases). • Data structured in this way consists of individuals, their attributes, and relations between individuals (e.g. database of students, classes, etc.).
Probabilistic Relational Models • Probabilistic relational models are a natural way to represent and learn statistical regularities in structured information. • Moreover, because they are close to relational databases, they are ideal for data mining.
Introduction to Bayes Nets • The problem with using joint probability distributions is their exponential character in the general case. • E.g., assume that there are four random variables: • Student Intelligence, Course Difficulty, Student Understands Material, Student Grade. • Assume Student Intelligence, Course Difficulty and Student Understands Material have three possible values: {low, medium, high}.
Exponential Complexity of JPDs • Further assume that the Student Grade has six possible values: {A, B, C, D, E, F}. • Then, to have a joint probability distribution, we need to specify (or learn) 3x3x3x6 = 162 values. • Imagine if we had hundreds of variables...
Independence Assumptions in Bayes Nets • Bayes nets help because they exploit the fact that each variable typically depends directly only on a small number of other variables. • In our example, we have:
Intelligence Difficulty Understands Material Grade Example of a Bayes Net
Conditional Independence • That is, for example the difficulty of the test is independent of the intelligence of the student. • Also, importantly, the grade is independent of the intelligence of the student or the difficulty of the test, given the student’s understanding of the material Conditional independence.
Conditional Independence • Formally, we have: • Every node X is independent of every other node that does not descend from X, given the values of X’s parents.