450 likes | 1.28k Views
This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples. <br>Below are the topics covered in this tutorial: <br>1) Machine Learning Introduction <br>2) Classification <br>3) Types of classifiers <br>4) Decision tree <br>5) How does Decision tree work? <br>6) Demo in R <br><br>You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
E N D
Decision Tree www.edureka.co/data-science Edureka’s Data Science Certification Training
What Will You Learn Today? 3 1 2 Types Of Classifiers Machine Learning Classification 5 6 4 How Decision Tree Works? Demo In R: Diabetes Prevention Use Case What Is Decision Tree? www.edureka.co/data-science Edureka’s Data Science Certification Training
Machine Learning www.edureka.co/data-science Edureka’s Data Science Certification Training
Introduction To Machine Learning Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Learn Algorithm Perform Training Data Build Model Feedback www.edureka.co/data-science Edureka’s Data Science Certification Training
Machine Learning - Example Amazon has huge amount of consumer purchasing data. The data consists of consumer demographics (age, sex, location), purchasing history, past browsing history. Based on this data, Amazon segments its customers, draws a pattern and recommends the right product to the right customer at the right time. www.edureka.co/data-science Edureka’s Data Science Certification Training
Classification www.edureka.co/data-science Edureka’s Data Science Certification Training
Introduction To Classification Is this A or B ? Classification is the problem of identifying to which set of categories a new observation belongs. It is a supervised learning model as the classifier already has a set of classified examples and from these examples, the classifier learns to assign unseen new examples. Example: Assigning a given email into "spam" or "non-spam" category. www.edureka.co/data-science Edureka’s Data Science Certification Training
Classification - Example Feed the classifier with training data set and predefined labels. It will learn to categorize particular data under a specific label. How to train my model to identify spam mails from genuine mails? Source IP Address Phrases in the text Subject Line HTML Tags www.edureka.co/data-science Edureka’s Data Science Certification Training
Classification Use Cases Banking Identification of loan risk applicants by their probability of defaulting payments. Medicine Banking Medicine Identification of at-risk patients and disease trends. Remote sensing Identification of areas of similar land use in a GIS database. Use-cases Marketing Identifying customer churn. Remote sensing Marketing www.edureka.co/data-science Edureka’s Data Science Certification Training
Types Of Classifiers Naïve Bayes Decision Tree Random Forest • Decision tree builds classification models in the form of a tree structure. • It breaks down a dataset into smaller and smaller subsets. • It is a classification technique based on Bayes' Theorem with an assumption of independence among attributes. • Random Forest is an ensemble classifier made decision tree models. • Ensemble models results from different models. using many combine the www.edureka.co/data-science Edureka’s Data Science Certification Training
What is Decision Tree? www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Decision Tree? A decision tree uses a tree structure to specify sequences of decisions and consequences. A decision tree employs a structure of nodes and branches. The depth of a node is the minimum number of steps required to reach the node from the root. Eventually, a final point is reached and a prediction is made. Gender Root Node Male Female Depth=1 Branch Node Income Age Internal Node >45000 <=40 >40 <=45000 Yes No No Yes Leaf Node www.edureka.co/data-science Edureka’s Data Science Certification Training
Use Case - Credit Risk Detection To minimize loss, the bank needs a decision rule to predict whom to give approval of the loan. An applicant’s demographic (income, debts, credit history) and socio-economic profiles are considered. Data science can help banks recognize behavior patterns and provide a complete view of individual customers. www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? Let’s take an example, We have taken dataset consisting of: • Weather information of last 14 days • Whether match was played or not on that particular day Now using the decision tree we need to predict whether the game will happen if the weather condition is Outlook Humidity Wind Play = = = = Rain High Weak ? www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? From our data, we will choose one variable “Outlook” and will see how it affects the variable “Play” Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No There are 3 types of Outlook Here Play: 9 Yes, 5 No Outlook Sunny Rain Overcast www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? We can further divide our data based on Outlook. 9 Yes / 5 No Outlook Overcast Rain Sunny Day Outlook Humidity D1 Sunny D2 Sunny D8 Sunny D9 Sunny D11 Sunny Wind Weak Strong Weak Weak Strong Day Outlook Humidity D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain Wind Weak Weak Strong Weak Strong High High High Normal Normal Day Outlook Humidity D3 Overcast D7 Overcast Normal D12 Overcast D13 Overcast Normal Wind Weak Strong Strong Weak High Normal Normal Normal High High High 2 Yes / 3 No Split further Pure subset Will play 3 Yes / 2 No Split further We will split the data until we get pure subsets at every branch www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Outlook Overcast Rain Sunny Day Outlook Humidity D3 Overcast D7 Overcast Normal D12 Overcast D13 Overcast Normal Wind Weak Strong Strong Weak High Humidity High Day Outlook Humidity D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain Wind Weak Weak Strong Weak Strong Pure subset Will play High Normal High Normal Normal Normal High Day Humidity Wind D1 High D2 High D8 High Pure subset Weak Strong Weak Day Humidity Wind D9 Normal D11 Normal Pure subset Weak Strong 3 Yes / 2 No Split further Will not play Will play www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Outlook Overcast Rain Sunny Day Outlook Humidity D3 Overcast D7 Overcast Normal D12 Overcast D13 Overcast Normal Wind Weak Strong Strong Weak High Humidity Wind High Pure subset Will play High Normal Weak Strong Day Humidity Wind D1 High D2 High D8 High Pure subset Day Humidity Wind D4 High D5 Normal D10 Normal Pure subset Day Humidity Wind D6 Normal D14 High Pure subset Weak Strong Weak Day Humidity Wind D9 Normal D11 Normal Pure subset Weak Weak Weak Strong Strong Weak Strong Will not play Will play Will not play Will play www.edureka.co/data-science Edureka’s Data Science Certification Training
How Decision Tree Works? We will use Humidity column to split the subset “Sunny” further. Outlook Overcast Rain Sunny Will play Humidity Wind High Normal Weak Strong Will not play Will play Will not play Will play www.edureka.co/data-science Edureka’s Data Science Certification Training
How will I know which attribute to take? I’ll show you how www.edureka.co/data-science Edureka’s Data Science Certification Training
Problem – Client Subscription Consider the case of a bank that wants to market its products to the appropriate customers. Given the demographics of clients and their reactions to previous campaign phone calls, the bank's goal is to predict which clients would subscribe. The attributes are: • Job • Marital status • Education • Housing • Loan • Contact • Poutcome www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Choose An Attribute? A common way to identify the most informative attribute is to use entropy-based methods. The entropy methods select the most informative attribute. Entropy (H) can be calculated as, x = Datapoint p(x) = Probability of x H = Entropy of x www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Choose An Attribute? Let’s say, the overall fraction of the clients who have not subscribed to is 1,789 out of the total population of 2,000. Now, let’s do some mathematics on it P(subscribed = yes) = 1-1789/2000 =10.55% Therefore, the root is only 10.55% pure on the subscribed = yes class. Conversely, it is 89.45% pure on the subscribed = no class. P(subscribed=yes) = 0.1055 P(subscribed=no) = 0.8945 Hsubscribed = −0.1055·log20.1055–0.8945·log20.8945 ≈ 0.4862 www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Choose An Attribute? Conditional entropy is, Calculating conditional entropy for subscribed|contact gives us following result. Hsubscribed|contact= 0.4661 www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Choose An Attribute? The information gain of an attribute A is defined as the difference between the base entropy (HS) and the conditional entropy of the attribute (HS|A). InfoGainA= HS– HS|A Attribute poutcome has the most information gain and is the most informative variable. Therefore, poutcome is chosen for the first split of the decision tree. www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Choose An Attribute? Finally, we get the following decision tree Poutcome Root Node Failure, Other, Unknown Branch Node Success No Education Internal Node Secondary, tertiary Primary, Unknown Yes Job Leaf Node Admin, blue-collar, management, technician Self-employed, student, unemployed Yes No www.edureka.co/data-science Edureka’s Data Science Certification Training
Decision Tree - Pros And Cons www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo www.edureka.co/data-science Edureka’s Data Science Certification Training
What if we could predict the occurrence of diabetes and take appropriate measures beforehand to prevent it? Sure! Let me take you through the steps to predict the vulnerable patients. www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo Doctor gets the following data from the medical history of the patient. Data Acquisition Divide dataset Implement model Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo We will divide our entire dataset into two subsets as: • Training dataset -> to train the model • Testing dataset -> to validate and make predictions Data Acquisition Divide dataset Implement model Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo Here, we implement decision tree in R using following commands. Data Acquisition Divide dataset Implement model Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo We get the output as follows but this is not easy to understand, so let’s visualize it for better understanding. Data Acquisition Divide dataset Implement model Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo For plotting we can use the following commands Data Acquisition > plot(diabet_model,margin = 0.1) > text(diabet_model,use.n= TRUE,pretty = TRUE,cex =0.6) Divide dataset glucose_conc< 154.5 Implement model BMI <26.35 Age >=53.5 Visualize glucose_conc< 100.5 Age <30.5 NO 6/4 YES 9/65 NO 107/3 Diabetes_pedigree_fn<0.315 glucose_conc< 131 NO 93/13 Model Validation blood_pressure>=72 Age >=53.5 NO 68/18 YES 5/11 NO 12/3 NO 35/18 NO 5/2 YES 13/39 www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo Data Acquisition Now, we can use our model to predict the output of our testing dataset. We can use the following code for predicting the output. Divide dataset Implement model pred_diabet<-predict(diabet_model,newdata = diabet_test,type ="class") pred_diabet Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo We get the following output for our testing dataset where: Data Acquisition “YES” means the probability of patient being vulnerable to diabetes is positive Divide dataset “NO” means the probability of patient being vulnerable to diabetes is negative. Implement model Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo We can create confusion matrix for the model using the library caret to know how good is our model. Data Acquisition Divide dataset Implement model library(caret) confusionMatrix(table(pred_diabet,diabet_test$is_diabetic)) Visualize Model Validation www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo The accuracy (or the overall success rate) is a metric defining the rate at which a model has classified the records correctly. A good model should have a high accuracy score. Data acquisition Data Acquisition Divide dataset Divide dataset Implement model Implement model Visualize Visualize Model Validation Accuracy = 71.13% www.edureka.co/data-science Edureka’s Data Science Certification Training
Course Details Get Edureka Certified in Data Science Today! Go to www.edureka.co/data-science Gnana Sekhar says - “Edureka Data science course provided me a very good mixture of theoretical and practical training. LMS pre recorded sessions and assignments were very good as there is a lot of information in them that will help me in my job. Edureka is my teaching GURU now...Thanks EDUREKA.” Shravan Reddy says- “I would like to recommend any one who wants to be a Data Scientist just one place: Edureka. Explanations are clean, clear, easy to understand. Their support team works very well.. I took the Data Science course and I'm going to take Machine Learning with Mahout and then Big Data and Hadoop”. Balu Samaga says - “It was a great experience to undergo and get certified in the Data Science course from Edureka. Quality of the training materials, assignments, project, support infrastructures are a top notch.” What our learners have to say about us! and other www.edureka.co/data-science Edureka’s Data Science Certification Training
www.edureka.co/data-science Edureka’s Data Science Certification Training