Comprehensive Guide: Machine Learning in R for Statistical Offices

Machine Learning in Rand its use in the statistical offices stat.unido.org v.todorov@unido.org

Outline • Machine learning and R • R packages • Machine learning in official statistics • Top 10 algorithms • References

What I talk about when I talk about Machine Learning 3

R and R packages • What makes R so useful? • The users can extend and improve the software or write variations for specific tasks. • The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques • Each R package provides a structured standard documentation including code application examples

R and R packages ## Naive Bayes example > install.packages('e1071', dependencies = TRUE) > library(class) > library(e1071) > data(iris) > pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

R and R packages > classifier <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(classifier, iris[,-5]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47

Machine Learning for Official Statistics • Automatic Coding • Editing and Imputation • Record Linkage • Other Methods

Automaticcoding • Automatic coding via Bayesian classifier:caret, klaR • Automatic occupation coding via CASCOT: algorithm not described • Automatic coding via open-source indexing utility: ? • Automatic coding of census variables via SVM:e1071 (interface to libsvm)

Editing and Imputation • Categorical data imputation via neural networks and Bayesian networks:neuralnet, gRain, bnlearn, deal • Identification of error-containing records via classification trees:rpart, tree, caret • Imputation donor pool screening via cluster analysis:class, klaR, cluster, kmeans(), hclust() • Imputation via Classification and Regression Trees (CART):rpart, caret, RWeka • Determination of imputation matching variables via Random Forests:randomForest • Creation of homogeneous imputation classes via CART:rpart • Derivation of edit rules via association analysis:arules

Record Linkage • Weighting vector classification: • The last major step in record linkage or record de-duplication • could be understood as a classification problem • In R: rpart, bagging() in package ipred, ada, functions svm() and nnet() in package e1071

Other Methods • Questionnaire consolidation via cluster analysis: class, klaR, cluster... • Forming non-response weighting groups via classification trees: rpart, tree, caret • Non-respondent prediction via classification trees: rpart, tree, caret • Analysis of reporting errors via classification trees: rpart, tree, caret • Substitutes for surveys via internet scraping: scrapeR, rvest • Tax evader detection via k-nearest neighbours: class, kknn • Crop yield estimation via image processing on satellite imaging data: is this ML?

Do we Need Hundreds of Classiers to Solve Real WorldClassication Problems? • Fernandez-Delgado, Cernadas, Barro (2014) • Evaluate 179 classifiers arising from 17 families on 121 data sets • By far best are random forests and SVM with Gaussian kernel • Most of the best classiffiers are implemented in R and tuned usingcaret • seems the best alternative to select a classier implementation

Top 10 ML/DM Algorithms Xindong Wu and Vipin Kumar (2009) • C4.5 – generates classifiers expressed as decision trees or ruleset form • K-Means – simple iterative method to partition a given dataset into a userspecified number of clusters, k • SVM– support vector machines • Apriori- derive association rules • EM - Expectation–Maximization algorithm • PageRank- produces a static ranking of Web pages • AdaBoost – Ensemble learning • kNN - k-nearest neighbor classification • Naive Bayes– simple classifier, applying the Bayes‘ theorem with independence assumptions between the features • CART - Classification and Regression Trees

The best R packages for ML • e1071: Naive Bayes, SVM, latent class analysis • rpart: regression trees • RandomForest: RF • gbm: generalized boosting models • kernlab: SVM • caret: Classification and Regression Training • neuralnet: neural networks CRAN Task View: Machine Learning & Statistical Learning

Machine learning books

Comprehensive Guide: Machine Learning in R for Statistical Offices

Comprehensive Guide: Machine Learning in R for Statistical Offices

Presentation Transcript

Machine Learning in Bioinformatics

Data collection on homelessness in statistical offices in France

Statistical Machine Learning and Computational Biology

Machine Learning in the Cloud

Bayesian Machine learning and its application

Fig. 7 Energy use in offices

Statistical Analysis and Machine Learning using Hadoop

Statistical Learning in Astrophysics

Machine Learning in GATE

Use of Machine Learning in Chemoinformatics

Using Statistical Machine Learning in Cloud Computing

Machine Learning and its Applications in Bioinformatics

Machine and Statistical Learning for Database Querying

Experiments in Machine Learning

Machine Learning and Multivariate Statistical Methods in Particle Physics

Using Statistical Machine Learning in Cloud Computing

Data collection on homelessness in statistical offices in France

Machine Learning And its Applications in different Sectors

Machine Learning Use cases in Data Management

Human in-the-loop in Machine Learning

Machine Learning and Its Impact in Different Industries