190 likes | 504 Views
Introduction to Data Mining with Weka. Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist. Agenda. Introduction What does Open Source mean? Data Science and Data Mining Open Source Data Mining Tools Weka Overview Profiling Demonstration
E N D
Introduction to Data Mining with Weka Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist
Agenda • Introduction • What does Open Source mean? • Data Science and Data Mining • Open Source Data Mining Tools • Weka • Overview • Profiling Demonstration • Analysis Demonstration • Summary
Introduction – Who am I? • Datasource Consulting Employee for past 3 year developing, using and evaluating open source and enterprise Business Intelligence tools • New hire to spotXchange as Principal Data Scientist • Bachelor of Science degree in Computer Science & Mathematics • Masters in Applied Statistics • Experience with databases, ETL, and analytics • Using “Open Source” or “free software” more than 25 years • Market analysis in aerospace, financial, telephony, and retail
What is Open Source? • A software development project in which code is developed by peer production and collaboration, with the end-product, source-code and documentation available at no cost to the public. • Free Access to Source Code • Free Redistribution • Strong development community • Examples: • Linux • Hadoop • Apache/Tomcat • MySQL • Weka
Data Science and Data Mining • Data Science process defined by Dr. DJ Patil, previous head of Data Analytics at LinkedIn • Clean-up and preparation of data • Create measurable levers to increase the value of the business • Monitor if state of metrics for changes • Experiment with the results of the models • Traditional Data Mining is used for… • Profiling data to check for quality e.g. max, min, data types, and patterns between variables • Finding relationships between variables or independent variables, e.g. clusters, regressions • Checking variance of a measure over time • Determine the level an experiment produced significant results
Profiling and Heavy Lifting • Fun Stuff • See what you never thought possible • Name: Mr. Ed • Genus: Equus • Address: Apt 302, Manhattan, NY 10033
Data Mining Tools • Reference: http://www.phiresearchlab.org/downloads/OpenSourceDataMining.pdf
Weka Introduction • Waikato Environment for Knowledge Analysis (WEKA) • Developed by the University of Waikato, New Zealand • Java based distributed under the GNU Public License • Explorer • Preprocessing, attribute selection, learning, visualization • Experimenter • Testing and evaluating machine learning algorithms • Knowledge Flow • Data-flow interface to WEKA • SimpleCLI
load filter analyze
Weka Pre-process Demo • Load and view csv data • Compare pairs of attributes • Examine min/max data value • Compare nominal and numeric values • Save in ARFF format • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html
Attribute-Relation File Format @relation workers @attribute age numeric @attribute workclass {' State-gov',' Self-emp-not-inc',' Private',' Federal-gov',' Local-gov',' ?',' Self-emp-inc',' Without-pay',' Never-worked'} @attribute ' fnlwgt' numeric : @attribute ' wage' {' <=50K',' >50K'} @data 39,' State-gov',77516,' Bachelors',13,' Never-married',' Adm-clerical',' Not-in-family',' White',' Male',2174,0,40,' United-States',' <=50K' 50,' Self-emp-not-inc',83311,' Bachelors',13,' Married-civ-spouse',' Exec-managerial',' Husband',' White',' Male',0,0,13,' United-States',' <=50K' 38,' Private',215646,' HS-grad',9,' Divorced',' Handlers-cleaners',' Not-in-family',' White',' Male',0,0,40,' United-States',' <=50K'
Weka Classify Features • 49 data preprocessing tools • 76 classification/regression algorithms • 8 clustering algorithms • 15 attribute/subset evaluators + 10 search algorithms for feature selection. • 3 algorithms for finding association rules • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html
Linear Regression • Predicted attribute is continuous • Correlation Coefficient determines fit of data • measures the strength and the direction of a linear relationship • -1 < r < +1 • A correlation greater than 0.8 is generally described as strong, depending on the type of data • Uses • Forecasting • Exploring factor effects • Demo: cpu.arff
Classification • Predicted attribute is categorical • Implemented methods • Naïve Bayes • decision trees and rules • neural networks • support vector machines • Demo: J48 decision tree with weather.arff
That’s All Nancy Abramson nabramson@ieee.org 720-468-1796 ?