340 likes | 1.28k Views
Mallet. MA chine L earning for L anguag E T oolkit. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. About MALLET.
E N D
Mallet MAchineLearning for LanguagEToolkit
Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion
Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion
About MALLET • "MALLET: A Machine Learning for Language Toolkit.“ • written by Andrew McCallum • http://mallet.cs.umass.edu. 2002. • Implemented in Java, currently version 2.0.6 • Motivation: • Text classification and information extraction • Commercial machine learning • Analysis and indexing of academic publications
About MALLET • Main idea • Text focus: data is discrete rather than continuous, even when values could be continuous • How to • Command line scripts: • bin/mallet [command] --[option] [value] … • Text User Interface (“tui”) classes • Direct Java API • http://mallet.cs.umass.edu/api
Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion
Representations • Transform text documents to vectors x1 , x2 … • Elements of vector are called feature values • Example: “Feature at row 345 is number of times “dog” appears in document” • Retain meaning of vector indices
Outline • About MALLET • Representing Data • Command Line Processing • Developing with MALLET • Conclusion
Command Line • Importing Data • Classification • Sequence Tagging • Topic Modeling
Importing Data • One Instance per file • files in the folder: sample-data/web/enor sample-data/web/de • command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet • One file, one instance per line • file format: [URL] [language] [text of the page...] • command line: bin/mallet import-file --input /data/web/data.txt --output web.mallet
Classification • Training a classifier bin/mallet train-classifier --input training.mallet --output-classifier my.classifier • Choosing an algorithm • MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier--trainer MaxEnt • Evaluation • Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. bin/mallet train-classifier --input labeled.mallet --training-portion 0.9
Sequence Tagging • Sequence algorithms • hidden Markov models (HMMs) • linear chain conditional random fields (CRFs). • SimpleTagger • a command line interface to the MALLET Conditional Random Field (CRF) class
SimpleTagger • Input file: [feature1 feature2 ... featurenlabel] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun • Train a CRF • An input file “sample” • A trained CRF in the file "nouncrf" java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample
SimpleTagger • A file “stest” needed to be labeled CAPITAL Al slept here • Label the input java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrfstest • Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here
Topic Modeling • Building Topic Models bin/mallet train-topics --input topic-input.mallet--num-topics 100 --output-state topic-state.gz --input [FILE] --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.
Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion
Methodology • Focus on sequence tagging module in MALLET • CRF-based implementation • Some scripts written for importing data and evaluating results • Small corpora collected from web • Divided into two parts, 80% for training, 20% for test • Evaluate both POS Tagging and Named Entity Recognition • The performance of training • Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) • All scripts, corpora and results can be found here • http://mallet-eval.googlecode.com
A Survey of Named Entity Corpora • Well known named entity corpora • Language-Independent Named Entity Recognition at CoNLL-2003 • A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) • free and public, but need RCV1 raw texts as the input • Message Understanding Conference (MUC) 6/ 7 • not for free • Affective Computational Entities (ACE) Training Corpus • not for free • Other special purpose corpora • Enron Email Dataset • email messages in this corpus are tagged with person names, dates and times. • A variety of biomedical corpora • some corpora in this collection are tagged with entities in the biomedical domain, such as gene name
Small Corpora • Two small corpora collected from web • Penn Treebank Sample • English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. • raw, tagged, parsed and combined data from Wall Street Journal • 148120 tokens, 36 Standard treebank POS tagger • http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ • HIT CIR LTP Corpora Sample • Chinese NER corpora integrated • 10% of the whole corpora (open to public) • 23751 tokens, 7 kinds of named entities • http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
Environment • Hardware • CPU: Q8300 Quad Core 2.50 GHz • Memory: 3GB • Software • Fedora 13 x86_64 • Java 1.6.0_18 • MALLET 2.0.6
Data Format and Labels • Data Format • Each token one row, each feature one column Bill noun slept non-noun Here non-noun • Labels • Standard treebank POS Tagger • CCCoordinating conjunction | CD Cardinal number | DT Determiner | EXExistential there | FW Foreign word | INPreposition or subordinating conjunction | JJ Adjective | JJRAdjective, comparative | JJSAdjective, superlative | LS List item marker | MD Modal | NNNoun, singular or mass | NNSNoun, plural …… (36 taggers in all) • HIT Named Entity • O 不是NE | S- 单独构成 NE | B- 一个NE 的开始 | I- 一个NE 的中间 | E- 一个 NE 的结尾 • Nm 数词| Ni 机构名 | Ns 地名 | Nh人名 | Nt时间 | Nr 日期 | Nz专有名词 • Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni
Evaluation Tasks Stages