430 likes | 649 Views
ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL. EBL. Exemplar-Based Learning. ACL’99 Tutorial on: Symbolic Machine Learning methods for NLP (Mooney & Cardie 99). Preliminaries The k -Nearest Neighbour algorithm Distance Metrics
E N D
ML: Classical methods from AI • Decision-Tree induction • Exemplar-based Learning • Rule Induction • TBEDL
EBL Exemplar-Based Learning ACL’99 Tutorial on: Symbolic Machine Learning methods for NLP (Mooney & Cardie 99) • Preliminaries • The k-Nearest Neighbour algorithm • Distance Metrics • Feature Relevance and Weighting • Example Storage
EBL Preliminaries • Unlike most learning algorithms, exemplar-based, also called (case,instance,memory)-based, approaches do not construct an abstract hypothesis but instead base classification of test instances on similarity to specific training cases (e.g. Aha et al. 1991) • Training is typically very simple: Just store the training instances • Generalization is postponed until a new instance must be classified Therefore, exemplar-based methods are sometimes called “lazy” learners (Aha, 1997)
EBL Preliminaries • Given a new instance, its relation to the stored examples is examined in order to assign a target function value for the new instance • In the worst case, testing requires comparing a test instance to every training instance • Consequently, exemplar-based methods may require more expensive indexing of cases during training to allow for efficient retrieval: e.g. k-d trees
EBL Preliminaries • Other solutions (simplifying assumptions): • Store only relevant examples: Instance editing, IB2, IB3, etc. (Aha et al., 1991) • IGTREEs (Daelemans et al. 1995) • etc. • Rather than estimating the target function for the entire instance space, case-based methods estimate the target function locally and differently for each new instance to be classified.
EBL k-Nearest Neighbours (Cover & Hart, 1967; Duda & Hart, 1973) • Calculate the distance between a test instance and every training instance • Pick the k closest training examples and assign to the test example the most common category among these “nearest neighbours” 1-nearest neighbour says + 5-nearest neighbours say - ? • Voting multiple neighbours help increase resistance to noise. For binary classification tasks, odd values of k are normally used to avoid ties.
EBL Implicit Classification Function • Although it is not necessary to explicitly calculate it, the learned classification rule is based on the regions of feature space closest to each training example • For 1-nearest neighbour, the Voronoi diagram gives the complex plyhedra that segment the space into the region of points closest to each training example
EBL Distance Metrics • Exemplar-based methods assume a function for calculating the (dis)similarity of two instances • Simplest approach: • For continuous feature vectors, just use Euclidean distance • To compensate for differences in units, scale all continuous values to normalize their values to be between 0 and 1
EBL Distance Metrics • Simplest approach: • For symbolic feature vectors assume Hamming distance (Overlapping measure)
EBL Distance Metrics • More sophisticated metrics • Distance metrics can account for relationships that exist among feature values. E.g., when comparing parts-of-speech feature, a proper noun and a common noun are probably more similar than a proper noun and a verb (Cardie & Mooney 99) • Graded similarity between discrete features: E.g., MVDM Modified Value Difference Metric (Cost & Salzberg, 1993) • Weighting attributes according to their importance (Cardie, 1993; Daelemans et al., 1996) • Many empirical comparative studies between distance metrics for exemplar-based classification: (Aha & Goldstone, 1992; Friedman, 1994; Hastie & Tibshirani, 1995; Wettschereck et al., 1997; Blanzieri & Ricci, 1999)
EBL Distance Metrics: MVDM • Modified Value Difference Metric (Cost & Salzberg, 1993) • Graded similarity between discrete features • Better results in some NLP applications (e.g. WSD) (Escudero et al., 00b)
EBL Distance Metrics • More information on similarity metrics • Foundations of Statistical NLP (Manning & Schütze, 1999) • Survey on similarity metrics (HERMES project) (Rodríguez, 2001)
EBL k-NN: Variations • Linear search is not a very efficient classification procedure • The k-d tree data structure can be used to index training examples and find nearest neighbours in logarithmic time on average (Bentley 75; Friedman et al. 77) • Nodes branch on threshold test on individual features and leaves terminate at nearest neighbours • There is no loss of information (nor generalization) with respect to the straightforward representation
EBL k-NN: Variations (2) • k-NN can be used to approximate the value of a continuous function (regression) by taking the average function value of the k nearest neighbours (Bishop 95; Atkeson et al. 97) • All training examples can be used to contribute to the classification by giving every example a vote that is weighted by the inverse square of its distance from the test example (Shepard 68). See the implementation details and parameters of TiMBL 4.1 (Daelemans et al. 01)
EBL Feature Relevance • The standard distance metric weight each feature equally, which can cause problems if only a few of the features are relevant to the classification task, since method could be mislead by similarity along many irrelevant dimensions • The problem of finding the “best” feature set (or a weighting of the features) is a general problem for ML induction algorithms • Parenthesis: • Wrapper vs. Filter feature selection methods • Global vs. Local feature selection methods
EBL Feature Relevance (2) • Wrapper methods for feature selection generate a set of candidate features, run the induction algorithm with these features, use the accuracy of the resulting concept description to evaluate the feature set, and iterate (John et al. 94) • Forward selection / Backward elimination • Filter methods for feature selection consider attributes independently of the induction algorithm that will use them. They use an inductive bias that can be entirely different from the bias employed in the induction learning algorithm • Information gain / RELIEFF / Decision Trees / etc.
EBL Feature Relevance (3) • Global feature weighting methods compute a single weight vector for the classification task (e.g. Information Gain in the IGTREE algorithm) • Local feature weighting methods allow feature weights to vary for each training instance, for each test instance, or both (Wettschereck et al. 97) (e.g. Value-difference metric)
EBL Linguistic Biases and Feature Weighting • Assign weights based on linguistic or cognitive preferences (Cardie 96). Use of domain -a priori- knowledge for the Information Extraction task • recency bias: assign higher weights to features that represent temporally recent information • focus of attention bias: assign higher weights to features that correspond to words or constituents in focus (e.g. Subject of a sentence) • restricted memory bias: keep the N features with the highest weights
EBL Linguistic Biases and Feature Weighting (2) • Use cross validation to determine which biases apply to the task and which actual weights to assign • Hybrid filter-wrapper approach • Wrapper approach for bias selection • Selected biases direct feature weighting using filter approach
EBL Example Storage • Some algorithms only store a subset of the most informative training examples in order to focus the system and to make it more efficient: instance editing • Edit superfluous regular instances, e.g. The “exemplar growing” IB2 algorithm (Aha et al. 91) initialize set of stored examples to the empty_set for each example in the training set do if (the example is correctly classified by picking the nearest neigbour in the current stored examples) then do not store the example else add it to the set of stored examples end for
EBL Example Storage (2) • In some experimental tests this approach (IB2) works as well -if not better- than storing all examples • Edit unproductive exceptions, e.g. the IB3 algorithm (Aha et al. 93): Delete all instances that are bad class predictors for their neighbourhood in the training set • There is evidence that keeping all training instances is best in exemplar-based approaches to NLP (Daelemans et al. 99)
EBL EBL: Summary • EBL methods base decisions on similarity to specific past instances rather than constructing abstractions • Exemplar-based methods abandon the goal of maintaining concept “simplicity” • Consequently, they trade decreased learning time for increased classification time. • They store all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning”(Daelemans et al. 99)
EBL EBL: Summary • Important issues are: • Defining appropriate distance metrics • Representing appropriately example attributes • Efficient indexing of training cases • Handling irrelevant features
EBL Exemplar-Based Learning and NLP • Stress acquisition (Daelemans et al. 94) • Grapheme-to-phoneme conversion (van den Bosch & Daelemans 93; Daelemans et al. 99) • Morphology (van den Bosch et al. 96) • POS Tagging (Daelemans et al. 96; van Halteren et al. 98) • Domain-specific lexical tagging (Cardie 93) • Word sense disambiguation (Ng & Lee 96; Mooney 96; Fujii et al. 98; Escudero et al. 00)
EBL Exemplar-Based Learning and NLP • PP-attachment disambiguation (Zavrel et al. 97; Daelemans et al. 99) • Partial parsing and NP extraction (Veenstra 98; Argamon et al. 98; Cardie & Pierce 98; Daelemans et al. 99) • Context-sensitive parsing (Simmons & Yu 92) • Text Categorization (Riloff & Lehnert 94; Yang & Chute 94; Yang 99) • etc.
EBL EBL and NLP: Software • TiMBL: Tilburg Memory Based Learner (Daelemans et. al. 1998-2002) • From ILK (Induction of Linguistic Knowledge) group • Version 4.1 (new): http://ilk.kub.nl/software.html
EBL (Daelemans et al. 97) IGTREE • IGTREE approach: Uses trees for compression and classification in lazy learning algorithms • IGTREEs are “equivalent” to the Decision Trees that would be inductively acquired with a simple function for attribute selection (pre-calculated) • Compression of the base of examples = TDIDT • The attribute ordering is computed only at the root node, using Quinlan’s Gain Ratio and is kept constant during TDIDT • A final pruning step is performed in order to compress even more the information • Example1: IGTREEs applied to POS tagging(…) (Daelemans et al. 96)
EBL Example2:Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99) • Benchmark problems: grapheme-to-phoneme conversion with stress assignment (GS), part-of-speech tagging (POS), prepositional phrase attachment disambiguation (PP), and base noun phrase chunking (NP) • IB1-IG: Editing superfluous regular instances and unproductive exceptions. Measures: (1) Typicality; (2) Class-prediction strength • Editing exceptions is harmful for IB1-IG in all benchmark problems
EBL Example2:Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)
EBL Example2:Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)
EBL Example2:Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99) • Methods applied: IB1-IG < IGTREE < C5.0 • Forgetting by decision-tree learning (abstracting) can be harmful in language learning • IB1-IG “outperforms” IGTREE and C5.0 on the benchmark problems
EBL Example2:Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99) • The decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness) • Why forgetting exceptions is harmful?
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Semantic ambiguity (lexical-level): WSD • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Corpus: • (Ng 96,97) Available through LDC • 192,800 examples (full sentence) of 191 the most ambiguous English words (121 nouns and 70 verbs) • Avg. Number of senses: 7.2 per noun, 12.6 per verb, 9.2 overall • Features: • SetA: • SetB:
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) Comparative Study • Motivation: unconsistent previous results • Metrics: • Hamming distance (h) • Modified Value-Difference Metric, MVDM (cs) (Cost & Salzberg 93) • Variants: • Example weighting (e) • Attribute weighting (a) using RLM distance for calculating attribute weights (López de Mántaras 91)
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Results: SetAon a reduced set of 15 words
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Naive Bayes
ARESO 2000 • There must be an experimental error. Or not? EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Results: SetBon a reduced set of 15 words
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Results: SetBon a reduced set of 15 words • The binary representation of topical attributes is not appropriate for both NB and EB algorithms
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Proposal for EB: • Represent all topical binary attributes into a single multivalued attribute, containing the set of all content words that appear in the example (sentence) • Set the similarity measure between two values of this atribute to the matching coefficient • P(ositive)EB
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Results of Positive versions: SetBon a reduced set of 15 words
ARESO 2000 EBL Example3: Word Sense Disambiguation (Escudero et al. 00) • Results of Positive versions: On the whole corpus (191 words)
EBL EBL: Some Final Comments • Advantages • Well known and well studied approach • Extremely simple learning algorithm • Robust in the face of redundant features • It stores all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning”(Daelemans et al. 99) • Drawbacks • Sensitive to irrelevant features • Computational requirements: storing the base of exemplars, calculation of complex metrics, classification cost (needs efficient indexing and retrieval algorithms), etc.