740 likes | 874 Views
Ling 570: Day 8 Classification, Mallet. Roadmap. Open questions? Quick review of classification Feature templates. Classification Problem Steps. Input processing: Split data into training/ dev /test Convert data into a feature representation (aka Attribute Value Matrix) Training Testing
E N D
Roadmap • Open questions? • Quick review of classification • Feature templates
Classification Problem Steps • Input processing: • Split data into training/dev/test • Convert data into a feature representation (aka Attribute Value Matrix) • Training • Testing • Evaluation
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly”
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include:
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly”
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX.
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX. • Plug in XXX to get a binary valued feature. • Templates generate many features
Classifiers • Wide variety • Differ on several dimensions • Supervision • Learning Function • Input Features
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms • Semi-supervised: (bootstrapping) • True labels are provided for only a subset of examples • Co-training, semi-supervised SVM/CRF, etc
Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc
Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc • Graphically, decision boundary + + + - - -
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned?
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation! - - - - - - - - - ++ + + + +
Machine Learning Features • Inputs: • E.g.words, acoustic measurements, parts-of-speech, syntactic structures, semantic classes, .. • Vectors of features: • E.g. word: letters • ‘cat’: L1=c; L2 = a; L3 = t • Parts of syntax trees?
Machine Learning Features • Questions: • Which features and values should be used? • How should they relate to each other? • Issue 1: What values should they take? • Binary features – don’t do anything! • Real valued features *may* need to be normalized • Can force the values to have 0 mean and unit variance • Compute the mean and variance on the training set for real valued feature • Replace original value with • Can also bin them or binarize them – often this works better • Issue 2: Which ones are important? • Feature selection is sometimes important • Current approach
Machine Learning Toolkits • Many learners, many tools/implementations
Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented
Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented • mallet • Java, classifiers, sequence learners • More heavy duty
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners • Widely used, but • Research software: some bugs/gaps; odd documentation
Installation • Installed on patas • /NLP_TOOLS/tool_sets/mallet/latest/ • Directories: • bin/: script files • src/: java source code • class/: java classes • lib/: jar files • sample-data/: wikipedia docs for languages id, etc
Environment • Should be set up on patas • $PATH should include • /NLP_TOOLS/tool_sets/mallet/latest/bin • $CLASSPATH should include • /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar; /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar • Check: • which text2vectors • /NLP_TOOLS/tool_sets/mallet/latest/bin
Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification
Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification • Command line scripts • Shell scripts • Set up java environment • Invoke java programs • --help lists command line parameters for scripts
Mallet Data • Mallet data instances: • Instance_id label f1 v1 f2 v2 ….. • Stored in internal binary format: “vectors” • Binary format used by learners, decoders • Need to convert text files to binary format
Data Preparation • Built-in data importers • One class per directory, one instance per file • bin/mallet import-dir --input IF --output OF • Label is directory name • (Also text2vectors) • One instance per line • bin/mallet import-file --input IF --output OF • Line: instance label text ….. • (Also csv2vectors) • Create binary representation of text feature counts
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors)
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors • Ensures consistent feature representation • Note: can’t mix svmlight models with others
Accessing Binary Formats • vectors2info --input IF
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct • Creates random training/test splits in some ratio
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc
Building & Accessing Models • bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Can also use pre-split training & testing files • e.g. output of vectors2vectors • --training-file, --testing-file
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Confusion Matrix, row=true, column=predicted accuracy=1.0 • label 0 1 |total • 0 de 1 . |1 • 1 en . 1 |1 • Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 • Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0