240 likes | 367 Views
Empirical Learning Methods in Natural Language Processing. Ido Dagan Bar Ilan University, Israel. Introduction. Motivations for learning in NLP NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically
E N D
Empirical Learning Methods in Natural Language Processing Ido Dagan Bar Ilan University, Israel
Introduction • Motivations for learning in NLP • NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically • Much of language behavior is preferential in nature, so need to acquire both quantitative and qualitative knowledge
Introduction (cont.) • Apparently, empirical modeling obtains (so far) mainly “first-degree” approximation of linguistic behavior • Often, more complex models improve results only to a modest extent • Often, several simple models obtain comparable results • Ongoing goal – deeper modeling of language behavior within empirical models
Linguistic Background (?) • Morphology • Syntax – tagging, parsing • Semantics • Interpretation – usually out of scope • “Shallow” semantics: ambiguity, semantic classes and similarity, semantic variability
Information Units of Interest - Examples • Explicit units: • Documents • Lexical units: words, terms (surface/base form) • Implicit (hidden) units: • Word senses, name types • Document categories • Lexical syntactic units: part of speech tags • Syntactic relationships between words – parsing • Semantic relationships
Data and Representations • Frequencies of units • Co-occurrence frequencies • Between all relevant types of units (term-doc, term-term, term-category, sense-term, etc.) • Different representations and modeling • Sequences • Feature sets/vectors (sparse)
Tasks and Applications • Supervised/classification: identify hidden units (concepts) of explicit units • Syntactic analysis, word sense disambiguation, name classification, relations, categorization, … • Unsupervised: identify relationships and properties of explicit units (terms, docs) • Association, topicality, similarity, clustering • Combinations
Using Unsupervised Methods within Supervised Tasks • Extraction and scoring of features • Clustering explicit units to discover hidden concepts and to reduce labeling effort • Generalization of learned weights or triggering-rules from known features to similar ones (similarity or class based) • Similarity/distance to training as the basis for classification method (nearest neighbor)
Characteristics of Learning in NLP • Very high dimensionality • Sparseness of data and relevant features • Addressing the basic problems of language: • Ambiguity – of concepts and features • One way to say many things • Variability • Many ways to say the same thing
Supervised Classification • Hidden concept is defined by a set of labeled training examples (category, sense) • Classification is based on entailment of the hidden concept by related elements/features • Example: two senses of “sentence”: • word, paragraph, description Sense1 • judge, court, lawyer Sense2 • Single or multiple concepts per example • Word sense vs. document categories
Supervised Tasks and Features • Typical Classification Tasks: • Lexical: Word sense disambiguation, target word selection in translation, name-type classification, accent restoration, text categorization (notice task similarity) • Syntactic: POS tagging, PP-attachment, parsing • Complex: anaphora resolution, information extraction • Features (“feature engineering”): • Adjacent context: words, POS • In various relationships – distance, syntactic • possibly generalized to classes • Other: morphological, orthographic, syntactic
Learning to Classify • Two possibilities for acquiring the “entailment” relationships: • Manually: by an expert • time consuming, difficult – “expert system” approach • Automatically: concept is defined by a set of training examples • training quantity/quality • Training: learn entailment of concept by features of training examples (a model) • Classification: apply model to new examples
Supervised Learning Scheme “Labeled” Examples Training Algorithm Classification Model New Examples Classification Algorithm Classifications
Avoiding/Reducing Manual Labeling • Basic supervised setting – examples are annotated manually by labels (sense, text category, part of speech) • Settings in which labeled data can be obtained without manual annotation: • Anaphora, target word selectionThe system displays the file on the monitor and prints it. • Bootstrapping approachesSometimes referred as unsupervised learning, though it actually addresses a supervised task of identifying an externally imposed class (“unsupervised” training)
Learning Approaches • Model-based: define entailment relations and their strengths by training algorithm • Statistical/Probabilistic: model is composed of probabilities (scores) computed from training statistics • Iterative feedback/search (neural network): start from some model, classify training examples, and correct model according to errors • Memory-based: no training algorithm and model - classify by matching to raw training (compared to unsupervised tasks)
Evaluation • Evaluation mostly based on (subjective) human judgment of relevancy/correctness • In some cases – task is objective (e.g. OCR), or applying mathematical criteria (likelihood) • Basic measure for classification – accuracy • In many tasks (extraction, multiple class per-instance, …) most instances are “negative”; therefore using recall/precision measures, following information retrieval (IR) tradition • Cross validation – different training/test splits
Evaluation: Recall/Precision • Recall: #correct extracted/total correct • Precision: #correct extracted/total extracted • Recall/precision curve - by varying the number of extracted items, assuming the items are sorted by decreasing score
Micro/Macro averaging • Often results are evaluated for multiple tasks • Many categories, many ambiguous words • Macro-averaging: compute results separately for each category and average • Micro-averaging (common): refer to all classification instances, from all categories, as one pile and compute results • Gives more weight to common categories
Course Organization • Material organized mostly by types of learning approaches, while demonstrating applications as we go along • Emphasis on demonstrating how computational linguistics tasks can be modeled (with simplifications) as statistical/learning problems • Some sections covering the lecturer’s personal work perspective
Course Outline • Sequential modeling • POS tagging • Parsing • Supervised (instance-based) classification • Simple statistical models • Naïve Bayes classification • Perceptron/Winnow (one layer NN) • Improving supervised classification • Unsupervised learning - clustering
Course Outline (1) • Supervised classification • Basic/earlier models: PP-attachment, decision list, target word selection • Confidence interval • Naive Bayes classification • Simple smoothing -- add-constant • Winnow • Boosting
Course Outline (2) • Part-of-speech tagging • Hidden Markov Models and the Viterbi algorithm • Smoothing -- Good-Turing, back-off • Unsupervised parameter estimation with Expectation Maximization (EM) algorithm • Transformation-based learning • Shallow parsing • Transformation based • Memory based • Statistical parsing and PCFG (2 hours) • Full parsing - Probabilistic Context Free Grammar (PCFG)
Course Outline (3) • Reducing training data • Selective sampling for training • Bootstrapping • Unsupervised learning • Word association • Information theory measures • Distributional word similarity, similarity-based smoothing • Clustering
Misc. • Major literature sources: • Foundations of Statistical Natural Language Processing, by Manning & Schutze, MIT Press • Articles • Additional slide credits: • Prof. Shlomo Argamon, Chicago • Some slides from the book web-site