Introduction to Data Mining with Weka

Introduction to Data Mining with Weka Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist

Agenda • Introduction • What does Open Source mean? • Data Science and Data Mining • Open Source Data Mining Tools • Weka • Overview • Profiling Demonstration • Analysis Demonstration • Summary

Introduction – Who am I? • Datasource Consulting Employee for past 3 year developing, using and evaluating open source and enterprise Business Intelligence tools • New hire to spotXchange as Principal Data Scientist • Bachelor of Science degree in Computer Science & Mathematics • Masters in Applied Statistics • Experience with databases, ETL, and analytics • Using “Open Source” or “free software” more than 25 years • Market analysis in aerospace, financial, telephony, and retail

What is Open Source? • A software development project in which code is developed by peer production and collaboration, with the end-product, source-code and documentation available at no cost to the public. • Free Access to Source Code • Free Redistribution • Strong development community • Examples: • Linux • Hadoop • Apache/Tomcat • MySQL • Weka

Data Science and Data Mining • Data Science process defined by Dr. DJ Patil, previous head of Data Analytics at LinkedIn • Clean-up and preparation of data • Create measurable levers to increase the value of the business • Monitor if state of metrics for changes • Experiment with the results of the models • Traditional Data Mining is used for… • Profiling data to check for quality e.g. max, min, data types, and patterns between variables • Finding relationships between variables or independent variables, e.g. clusters, regressions • Checking variance of a measure over time • Determine the level an experiment produced significant results

Profiling and Heavy Lifting • Fun Stuff • See what you never thought possible • Name: Mr. Ed • Genus: Equus • Address: Apt 302, Manhattan, NY 10033

Data Mining Tools • Reference: http://www.phiresearchlab.org/downloads/OpenSourceDataMining.pdf

Weka Introduction • Waikato Environment for Knowledge Analysis (WEKA) • Developed by the University of Waikato, New Zealand • Java based distributed under the GNU Public License • Explorer • Preprocessing, attribute selection, learning, visualization • Experimenter • Testing and evaluating machine learning algorithms • Knowledge Flow • Data-flow interface to WEKA • SimpleCLI

load filter analyze

Weka Pre-process Demo • Load and view csv data • Compare pairs of attributes • Examine min/max data value • Compare nominal and numeric values • Save in ARFF format • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html

Attribute-Relation File Format @relation workers @attribute age numeric @attribute workclass {' State-gov',' Self-emp-not-inc',' Private',' Federal-gov',' Local-gov',' ?',' Self-emp-inc',' Without-pay',' Never-worked'} @attribute ' fnlwgt' numeric : @attribute ' wage' {' <=50K',' >50K'} @data 39,' State-gov',77516,' Bachelors',13,' Never-married',' Adm-clerical',' Not-in-family',' White',' Male',2174,0,40,' United-States',' <=50K' 50,' Self-emp-not-inc',83311,' Bachelors',13,' Married-civ-spouse',' Exec-managerial',' Husband',' White',' Male',0,0,13,' United-States',' <=50K' 38,' Private',215646,' HS-grad',9,' Divorced',' Handlers-cleaners',' Not-in-family',' White',' Male',0,0,40,' United-States',' <=50K'

Weka Classify Features • 49 data preprocessing tools • 76 classification/regression algorithms • 8 clustering algorithms • 15 attribute/subset evaluators + 10 search algorithms for feature selection. • 3 algorithms for finding association rules • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html

Linear Regression • Predicted attribute is continuous • Correlation Coefficient determines fit of data • measures the strength and the direction of a linear relationship • -1 < r < +1 • A correlation greater than 0.8 is generally described as strong, depending on the type of data • Uses • Forecasting • Exploring factor effects • Demo: cpu.arff

Classification • Predicted attribute is categorical • Implemented methods • Naïve Bayes • decision trees and rules • neural networks • support vector machines • Demo: J48 decision tree with weather.arff

That’s All Nancy Abramson nabramson@ieee.org 720-468-1796 ?

Introduction to Data Mining with Weka

Introduction to Data Mining with Weka

Presentation Transcript

Data Mining and the Weka Toolkit

Introduction to Weka

Data Mining and the Weka Toolkit

Introduction to Data Mining

Data Mining and the Weka Toolkit

INTRODUCTION TO DATA MINING

Data Mining with Weka Putting it all together

Introduction to Data Mining

Introduction to Data Mining with XLMiner

Weka – A Data Mining Toolkit

Data Mining and the Weka Toolkit

Introduction to Data Mining

Data mining by WEKA

Introduction to Data Mining

Advanced Data Mining with Weka - Edukite

Introduction to data mining

Introduction to Data Mining

Advanced data mining with TagHelper and Weka

WEKA: free data mining software