1 / 32

Knowledge Discovery in Databases

Knowledge Discovery in Databases. MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo. Profound Questions. What basic properties are the formula for a good wine?

art
Download Presentation

Knowledge Discovery in Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo

  2. Profound Questions • What basic properties are the formula for a good wine? • Wine making is believed to be an art. But is there a formula for a quality wine? • There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s?

  3. Procedure • Follow a data mining process • Use SAS and SAS Enterprise Miner to execute the process • SAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA – Sample, Explore, Modify, Model, Assess • SEMMA is similar to the CRISP DM process

  4. Sample • 1,599 records • Set up a data partition • Training 40% • Validation 30% • Test 30%

  5. Explore: Data Background • Data source • UCI Machine Learning Repository. • Wine Quality Data Set. • There are a red and white wine data set. I focused on the red wine set only. • There are 11 input variables and one target variable. • fixed acidity • volatile acidity • citric acid • residual sugar • chlorides • free sulfur dioxide • total sulfur dioxide • density • pH • sulphates • alcohol • Output variable (based on sensory data): quality (score between 0 and 10)

  6. Explore: Target=Quality • Quality • People gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8. • An ordinal target

  7. Explore: Inputs • Correlation Analysis • Some correlation, but not enough to discard inputs • ods graphics on; • ods select MatrixPlot; • proccorr data=wino.red PLOTS(MAXPOINTS=100000 ) • plots=matrix(histogram nvar=all); • var quality alcohol phfixed_acidity density volatile_aciditysulphatescitric_acid; • run;

  8. Explore: Correlation Graphs

  9. Explore: Chi2 Statistics of Inputs

  10. Explore: Worth of Inputs

  11. Explore: Worth Graph • The Worth Tracks closely with the Chi Statistic

  12. Modify • At this stage, no modifications are done

  13. Model: Selection • Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree • Configuration • The Splitting Rule is Entropy • Maximum Branch is set to 5 • Therefore a C4.5 type of algorithm is being implemented

  14. Assess: Initial Results • A Bushy Tree using. The Resulting tree is too intricate for simple recommendation. • Over 20 Leaf nodes.

  15. Modify: Target • Change the target so that it becomes a binary. • New variable in the model called isGood. Any rating over 6 is categorized as isGood. • SAS Code: datawino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0; run; procprint data = wino.xx; title 'xx'; run;

  16. Explore: Target = isGood

  17. Model Strategy for isGood • Model with Decision Tree to hope for more descriptive results. • Also model with Neural Network to aid in assessment and do comparison

  18. Model: Decision Tree • ProbF splitting criteria at Significance Level .2 • Maximum Branch size = 5

  19. Assess: Decision Tree Results • Much simpler Tree

  20. Assess: Decision Tree Results 2 • Leaf Statistics

  21. Assess: Variable Importance

  22. Model: Neural Network • Positive – better at predicting • Negative – hard to interpret the model • Configured with 3 Hidden Nodes

  23. Modify: Input Variables to NN • Because of the complexity of the NN, it is recommended to prune variables prior to running the network.

  24. Modify: R2 Filter

  25. Model: NN • Specify 3 Hidden Units in the Hidden Layer

  26. Assess: NN Results • Hard to interpret results to formulate a recipe The NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 alcohol_H11 3.679818 -0.001411 2 chlorides_H11 0.520190 -0.000479 3 density_H11 -2.171623 0.000883 4 fixed_acidity_H11 -0.055929 0.000179 5 free_sulfur_dioxide_H11 0.403412 0.000139 6 sulphates_H11 -4.954290 -0.000224 7 volatile_acidity_H11 2.686209 0.000205 8 alcohol_H12 -0.313005 0.001209 9 chlorides_H12 0.200973 0.000759

  27. Assess: Comparative Results • Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree

  28. Assess: Comparative Results • Cumulative Lift for NN vs Decision Tree

  29. Assess: Comparison with Reference Paper • Used R-Miner • Support Vector Machine (SVM) and Neural Network used • He applied techniques to extract relative importance of variables • He attempted to predict every quality level • He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.”

  30. Assess: Paper Variable Importance

  31. Overall Project in SAS EM

  32. References • UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Wine • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553, 2009. • Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf

More Related