170 likes | 188 Views
Explore machine learning applications in official statistics using top algorithms in R. Learn about automatic coding, editing, imputation, record linkage, and more. Discover the best R packages for ML tasks.
E N D
Machine Learning in Rand its use in the statistical offices stat.unido.org v.todorov@unido.org
Outline • Machine learning and R • R packages • Machine learning in official statistics • Top 10 algorithms • References
R and R packages • What makes R so useful? • The users can extend and improve the software or write variations for specific tasks. • The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques • Each R package provides a structured standard documentation including code application examples
R and R packages ## Naive Bayes example > install.packages('e1071', dependencies = TRUE) > library(class) > library(e1071) > data(iris) > pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
R and R packages > classifier <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(classifier, iris[,-5]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47
Machine Learning for Official Statistics • Automatic Coding • Editing and Imputation • Record Linkage • Other Methods
Automaticcoding • Automatic coding via Bayesian classifier:caret, klaR • Automatic occupation coding via CASCOT: algorithm not described • Automatic coding via open-source indexing utility: ? • Automatic coding of census variables via SVM:e1071 (interface to libsvm)
Editing and Imputation • Categorical data imputation via neural networks and Bayesian networks:neuralnet, gRain, bnlearn, deal • Identification of error-containing records via classification trees:rpart, tree, caret • Imputation donor pool screening via cluster analysis:class, klaR, cluster, kmeans(), hclust() • Imputation via Classification and Regression Trees (CART):rpart, caret, RWeka • Determination of imputation matching variables via Random Forests:randomForest • Creation of homogeneous imputation classes via CART:rpart • Derivation of edit rules via association analysis:arules
Record Linkage • Weighting vector classification: • The last major step in record linkage or record de-duplication • could be understood as a classification problem • In R: rpart, bagging() in package ipred, ada, functions svm() and nnet() in package e1071
Other Methods • Questionnaire consolidation via cluster analysis: class, klaR, cluster... • Forming non-response weighting groups via classification trees: rpart, tree, caret • Non-respondent prediction via classification trees: rpart, tree, caret • Analysis of reporting errors via classification trees: rpart, tree, caret • Substitutes for surveys via internet scraping: scrapeR, rvest • Tax evader detection via k-nearest neighbours: class, kknn • Crop yield estimation via image processing on satellite imaging data: is this ML?
Do we Need Hundreds of Classiers to Solve Real WorldClassication Problems? • Fernandez-Delgado, Cernadas, Barro (2014) • Evaluate 179 classifiers arising from 17 families on 121 data sets • By far best are random forests and SVM with Gaussian kernel • Most of the best classiffiers are implemented in R and tuned usingcaret • seems the best alternative to select a classier implementation
Top 10 ML/DM Algorithms Xindong Wu and Vipin Kumar (2009) • C4.5 – generates classifiers expressed as decision trees or ruleset form • K-Means – simple iterative method to partition a given dataset into a userspecified number of clusters, k • SVM– support vector machines • Apriori- derive association rules • EM - Expectation–Maximization algorithm • PageRank- produces a static ranking of Web pages • AdaBoost – Ensemble learning • kNN - k-nearest neighbor classification • Naive Bayes– simple classifier, applying the Bayes‘ theorem with independence assumptions between the features • CART - Classification and Regression Trees
The best R packages for ML • e1071: Naive Bayes, SVM, latent class analysis • rpart: regression trees • RandomForest: RF • gbm: generalized boosting models • kernlab: SVM • caret: Classification and Regression Training • neuralnet: neural networks CRAN Task View: Machine Learning & Statistical Learning