1 / 17

Comprehensive Guide: Machine Learning in R for Statistical Offices

Explore machine learning applications in official statistics using top algorithms in R. Learn about automatic coding, editing, imputation, record linkage, and more. Discover the best R packages for ML tasks.

nkara
Download Presentation

Comprehensive Guide: Machine Learning in R for Statistical Offices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in Rand its use in the statistical offices stat.unido.org v.todorov@unido.org

  2. Outline • Machine learning and R • R packages • Machine learning in official statistics • Top 10 algorithms • References

  3. What I talk about when I talk about Machine Learning 3

  4. R and R packages • What makes R so useful? • The users can extend and improve the software or write variations for specific tasks. • The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques • Each R package provides a structured standard documentation including code application examples

  5. R and R packages ## Naive Bayes example > install.packages('e1071', dependencies = TRUE) > library(class) > library(e1071) > data(iris) > pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

  6. R and R packages > classifier <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(classifier, iris[,-5]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47

  7. Machine Learning for Official Statistics • Automatic Coding • Editing and Imputation • Record Linkage • Other Methods

  8. Automaticcoding • Automatic coding via Bayesian classifier:caret, klaR • Automatic occupation coding via CASCOT: algorithm not described • Automatic coding via open-source indexing utility: ? • Automatic coding of census variables via SVM:e1071 (interface to libsvm)

  9. Editing and Imputation • Categorical data imputation via neural networks and Bayesian networks:neuralnet, gRain, bnlearn, deal • Identification of error-containing records via classification trees:rpart, tree, caret • Imputation donor pool screening via cluster analysis:class, klaR, cluster, kmeans(), hclust() • Imputation via Classification and Regression Trees (CART):rpart, caret, RWeka • Determination of imputation matching variables via Random Forests:randomForest • Creation of homogeneous imputation classes via CART:rpart • Derivation of edit rules via association analysis:arules

  10. Record Linkage • Weighting vector classification: • The last major step in record linkage or record de-duplication • could be understood as a classification problem • In R: rpart, bagging() in package ipred, ada, functions svm() and nnet() in package e1071

  11. Other Methods • Questionnaire consolidation via cluster analysis: class, klaR, cluster... • Forming non-response weighting groups via classification trees: rpart, tree, caret • Non-respondent prediction via classification trees: rpart, tree, caret • Analysis of reporting errors via classification trees: rpart, tree, caret • Substitutes for surveys via internet scraping: scrapeR, rvest • Tax evader detection via k-nearest neighbours: class, kknn • Crop yield estimation via image processing on satellite imaging data: is this ML?

  12. Do we Need Hundreds of Classiers to Solve Real WorldClassication Problems? • Fernandez-Delgado, Cernadas, Barro (2014) • Evaluate 179 classifiers arising from 17 families on 121 data sets • By far best are random forests and SVM with Gaussian kernel • Most of the best classiffiers are implemented in R and tuned usingcaret • seems the best alternative to select a classier implementation

  13. Top 10 ML/DM Algorithms Xindong Wu and Vipin Kumar (2009) • C4.5 – generates classifiers expressed as decision trees or ruleset form • K-Means – simple iterative method to partition a given dataset into a userspecified number of clusters, k • SVM– support vector machines • Apriori- derive association rules • EM - Expectation–Maximization algorithm • PageRank- produces a static ranking of Web pages • AdaBoost – Ensemble learning • kNN - k-nearest neighbor classification • Naive Bayes– simple classifier, applying the Bayes‘ theorem with independence assumptions between the features • CART - Classification and Regression Trees

  14. The best R packages for ML • e1071: Naive Bayes, SVM, latent class analysis • rpart: regression trees • RandomForest: RF • gbm: generalized boosting models • kernlab: SVM • caret: Classification and Regression Training • neuralnet: neural networks CRAN Task View: Machine Learning & Statistical Learning

  15. Machine learning books

More Related