Mallet

Mallet MAchineLearning for LanguagEToolkit

Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion

About MALLET • "MALLET: A Machine Learning for Language Toolkit.“ • written by Andrew McCallum • http://mallet.cs.umass.edu. 2002. • Implemented in Java, currently version 2.0.6 • Motivation: • Text classification and information extraction • Commercial machine learning • Analysis and indexing of academic publications

About MALLET • Main idea • Text focus: data is discrete rather than continuous, even when values could be continuous • How to • Command line scripts: • bin/mallet [command] --[option] [value] … • Text User Interface (“tui”) classes • Direct Java API • http://mallet.cs.umass.edu/api

Representations • Transform text documents to vectors x1 , x2 … • Elements of vector are called feature values • Example: “Feature at row 345 is number of times “dog” appears in document” • Retain meaning of vector indices

Documents to Vectors

Instances

Outline • About MALLET • Representing Data • Command Line Processing • Developing with MALLET • Conclusion

Command Line • Importing Data • Classification • Sequence Tagging • Topic Modeling

Importing Data • One Instance per file • files in the folder: sample-data/web/enor sample-data/web/de • command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet • One file, one instance per line • file format: [URL] [language] [text of the page...] • command line: bin/mallet import-file --input /data/web/data.txt --output web.mallet

Classification • Training a classifier bin/mallet train-classifier --input training.mallet --output-classifier my.classifier • Choosing an algorithm • MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier--trainer MaxEnt • Evaluation • Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Sequence Tagging • Sequence algorithms • hidden Markov models (HMMs) • linear chain conditional random fields (CRFs). • SimpleTagger • a command line interface to the MALLET Conditional Random Field (CRF) class

SimpleTagger • Input file: [feature1 feature2 ... featurenlabel] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun • Train a CRF • An input file “sample” • A trained CRF in the file "nouncrf" java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

SimpleTagger • A file “stest” needed to be labeled CAPITAL Al slept here • Label the input java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrfstest • Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Topic Modeling • Building Topic Models bin/mallet train-topics --input topic-input.mallet--num-topics 100 --output-state topic-state.gz --input [FILE] --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.

Demo

Methodology • Focus on sequence tagging module in MALLET • CRF-based implementation • Some scripts written for importing data and evaluating results • Small corpora collected from web • Divided into two parts, 80% for training, 20% for test • Evaluate both POS Tagging and Named Entity Recognition • The performance of training • Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) • All scripts, corpora and results can be found here • http://mallet-eval.googlecode.com

A Survey of Named Entity Corpora • Well known named entity corpora • Language-Independent Named Entity Recognition at CoNLL-2003 • A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) • free and public, but need RCV1 raw texts as the input • Message Understanding Conference (MUC) 6/ 7 • not for free • Affective Computational Entities (ACE) Training Corpus • not for free • Other special purpose corpora • Enron Email Dataset • email messages in this corpus are tagged with person names, dates and times. • A variety of biomedical corpora • some corpora in this collection are tagged with entities in the biomedical domain, such as gene name

Small Corpora • Two small corpora collected from web • Penn Treebank Sample • English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. • raw, tagged, parsed and combined data from Wall Street Journal • 148120 tokens, 36 Standard treebank POS tagger • http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ • HIT CIR LTP Corpora Sample • Chinese NER corpora integrated • 10% of the whole corpora (open to public) • 23751 tokens, 7 kinds of named entities • http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

Environment • Hardware • CPU: Q8300 Quad Core 2.50 GHz • Memory: 3GB • Software • Fedora 13 x86_64 • Java 1.6.0_18 • MALLET 2.0.6

Evaluation Tasks Stages

DEMO

Q&A

Mallet

Mallet

Presentation Transcript

Objective Grading of Four-Mallet Marimba Literature The Performance Level System

The In – Patient Stroke Rehabilitation Unit and Community Stroke Services based at Shepton Mallet Community Hospital

Introduction to Mallet

Presented by : Maryline Mallet Other members involved in this project : Yves Gagnon , Gérard Poitras, René Thibault

Mallet Finger

Open-Ended Pipe How can this make noise? -hit with a mallet -blow into one end

Harmonic Analysis of Mallet Percussion

Aren’t you Teaching MALLET Next Week!?

COMPUTERIZED MALLET CLASSIFICATION FOR BRACHIAL PLEXUS PALSY

Ling 570: Day 8 Classification, Mallet

MFIs and Energy Lending (1/2) Profiles and drivers Marion Allet mallet@ulb.ac.be

C. André, J. Boucaron, A. Coadou, J. DeAntoni , B. Ferrero, F. Mallet, R. de Simone

Starter: Mallet’s Mallet

The Semantics of MALLET--An Agent Teamwork Encoding Language

Mallet Masterclass

David E. Keyes Old Dominion University in collaboration with Vivien Mallet

Mallet Percussion Instruments

Best Mallet Putters for Beginners

Woodball mallet loading analysis during maximal swing stage: A finite element study

Car Body Accident Repairs - Corsham, Shepton Mallet

The Semantics of MALLET--An Agent Teamwork Encoding Language

Mallet Percussion Instruments