290 likes | 404 Views
About me. Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years Currently Senior Analyst at Deloitte Hobby – rock climbing , data mining competitions Why? - Early retirement Current interest – Text analytics .
E N D
About me • Educational background – Applied Econometrics • 4 years statistical modelling experience • R experience – 2 years • Currently Senior Analyst at Deloitte • Hobby – rock climbing, data mining competitions • Why? - Early retirement • Current interest – Text analytics
Topic: The benefits of R from a data mining competitor’s point of view and from the point of view of an employee at Deloitte • Work • Professional and pragmatic Home The playful scientist
Agenda • Quick introduction to R • What I use R for • R at work • Introduction to Deloitte • Frequently used tools • Some of the work we do using R • Examples • Challenges: Data Storage • Challenges: Standardisation • How Deloitte is addressing this issue • R at home: • Some of the work I do using R, at home • Flexibility and convenience • Examples • Prototyping and experimenting • Examples • Questions • Essential R packages for everyday use
Quick introduction to R • “A statistical software created by statisticians, for statisticians” • Personally, I use R for data analysis and statistical modelling • Unique features worth noting: • Open source – free, easy to find help in the active community • Understands mathematical computations and matrix operations naturally • Thousands of packages, implementations of almost any algorithm
Introduction to RThousands of packages, implementations of almost any algorithm Packages ggplot2 EBImage randomForest etc N = 500+
Introduction to Deloitte • We help clients capture, manage and analyse data to help solve important business problemsto make informed decisions • A holistic process of data mining
Introduction to Deloitte: Typical activity involved in a project at Deloitte But not everything is R Data preparation Level of Activity Modeling Planning processes Data loading Closing processes 20% - 40% time spent on modelling Initiating processes Time line
Frequently used tools • Geospatial analytics - Tactician • Segmentation - Self Organising maps • SQL server • Modelling • Visualisation
Some of the work we do using R • In Deloitte • Statistical Analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics (NEW!)
Examples: Time Series Actual --- Estimate y – retail activity? Fitted Time (days) R package: forecast
Challenges: Data Storage • We have a dedicated tool to store and clean data – SQL • R cannot handle large data sets Error: cannot allocate vector of size 2097151 Kb
Challenges: Standardisation • ‘You’re not the only one using it” One of the reason’s why other commercial tools are preferred over R • Transferable skills across the team • Reliability of packages • Standardised functions and procedures
How Deloitte is addressing this issue • Creating standardised process: R package: RODBC
How Deloitte is addressing this issue • Creating standardised functions: • # Density Plot for subject variable • DensityPlot <- function(dataset, col) { • ds <- data.frame(dataset);ds$c<- ds[,c(col)];a <- ggplot(data=ds, aes(x=c) ) • a <- a + geom_density(kernel="biweight");a • } • DensityPlot (dataset, column number) • Retrieving data from the database (RODBC): • conn <- odbcDriverConnect("driver=SQL Server; database=DataBaseName; server=servername;") • query <- “Select * from TableName” • df<- sqlQuery(conn,query) R package: RODBC
Some of the work I do using R, at home At home (data mining competitions) • Statistical analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics • Image analysis • (I mainly use R) • In Deloitte • Statistical Analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics (NEW!) • (we don’t just use R)
Flexibility and convenience • Is one of the easier programming languages to pick up • Dive into the analysis quickly
Examples • Image analysis R package: EBImage
Examples • Image Analysis R package: EBImage
Prototyping and experimenting • Access to the latest most innovative techniques • Great for prototyping new algorithms
Examples:Text analytics R package: twitteR +
Examples: Word cloud of twitter feeds R package: wordcloud
Examples:Text analytics What are the common themes that are being tweeted by Time magazine? + = ?
A Top words associated to the classification Tweet B C D A B C D R package: ggplot2
Essential R packages for everyday use • Essential • ggplot2 • reshape • RODBC • randomForest • rpart • Nice to have • caret • forecast • tm