700 likes | 1.42k Views
This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic Regression machine learning algorithm works in R. Towards the end, in our demo we will be predicting which patients have diabetes using Logistic Regression! In this Logistic Regression Tutorial you will understand: <br><br>1) The 5 Questions asked in Data Science <br>2) What is Regression? <br>3) Logistic Regression - What and Why? <br>4) How does Logistic Regression Work? <br>5) Demo in R: Diabetes Use Case <br>6) Logistic Regression: Use Cases
E N D
Logistic Regression www.edureka.co/data-science Edureka’s Data Science Certification Training
What Will You Learn Today? 3 1 2 The 5 Questions asked in Data Science Logistic Regression – What and Why? What is Regression? 5 6 4 Logistic Regression – Use Cases How does Logistic Regression work? Demo In R: Diabetes Use Case www.edureka.co/data-science Edureka’s Data Science Certification Training
The 5 Questions Asked In Data Science www.edureka.co/data-science Edureka’s Data Science Certification Training
The 5 Questions Asked In Data Science In data science, basically we have 5 kind of problems. Classification Algorithm Is this A or B? Q1. Anomaly Detection Algorithm Q2. Is this weird? How much or how many? Regression Algorithms Q3. How is this organized? Clustering Algorithms Q4. Reinforcement Learning What should I do next? Q5. www.edureka.co/data-science Edureka’s Data Science Certification Training
The 5 Questions Asked In Data Science In data science, basically we have 5 kind of problems. Classification Algorithm Is this A or B? Is this A or B? Q1. Anomaly Detection Algorithm Q2. Is this weird? How much or how many? Regression Algorithms Q3. How is this organized? Clustering Algorithms Q4. Reinforcement Learning What should I do next? Q5. www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Regression? www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Regression? Y-axis ➢ Regression analysis is a predictive modelling technique. ➢ It estimates the relationship between a dependent (target) and an independent variable (predictor). X-axis Input value = 7.00 Predicted outcome = 123.9 www.edureka.co/data-science Edureka’s Data Science Certification Training
Types Of Regression www.edureka.co/data-science Edureka’s Data Science Certification Training
Types Of Regression Linear Regression Polynomial Regression Logistic Regression • When there is a linear relationship between independent and dependent variables. • When the power of independent variable is more than 1. • When the dependent variable is categorical (0/ 1, True/ False, Yes/ No, A/B/C) in nature. Y Y X X www.edureka.co/data-science Edureka’s Data Science Certification Training
Types Of Regression Linear Regression Polynomial Regression Logistic Regression • When there is a linear relationship between independent and dependent variables. • When the power of independent variable is more than 1. • When the dependent variable is categorical (0/ 1, True/ False, Yes/ No, A/B/C) in nature. Y Y X X www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? Whenever the outcome of the dependent variable (Y) is discrete, like 0 or 1, Yes or No, A, B or C, we use logistic regression. www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Not Linear Regression? Whenever the outcome of the dependent variable (Y) is discrete, like 0 or 1, Yes or No, A, B or C, we use logistic regression. Why can’t we use Linear Regression? www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Not Linear Regression? 1 Y-axis 0 X-axis Now since our value of Y will be between 0 and 1, the linear line has to be clipped at 0 and 1. www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Not Linear Regression? 1 Y-axis 0 X-axis With this, our resulting curve cannot be formulated into a single formula. We needed a new way to solve this kind of problem. Hence, we came up with Logistic Regression! www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Curve LOGISTIC REGRESSION 1.2 1 0.8 0.6 The S Curve 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? Equation for a straight line Y = C + B1X1 + B2X2 + …. Range of Y is from – (infinity) to infinity www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? Equation for a straight line Y = C + B1X1 + B2X2 + …. Range of Y is from – (infinity) to infinity Let’s try to reduce the Logistic Regression Equation from this equation Y = C + B1X1 + B2X2 + …. In Logistic Regression Y can only be between 0 and 1. www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? Equation for a straight line Y = C + B1X1 + B2X2 + …. Range of Y is from – (infinity) to infinity Let’s try to reduce the Logistic Regression Equation from this equation Y = C + B1X1 + B2X2 + …. In Logistic Equation Y can only be between 0 and 1. Now, to get the range of Y between 0 and infinity, let’s transform Y Y=0 | 0 Y Now, we have the range between 0 and infinity 1 − Y Y=1 | infinity www.edureka.co/data-science Edureka’s Data Science Certification Training
Why Logistic Regression? Equation for a straight line Y = C + B1X1 + B2X2 + …. Range of Y is from – (infinity) to infinity Let’s try to reduce the Logistic Regression Equation from this equation Y = C + B1X1 + B2X2 + …. In Logistic Equation Y can only be between 0 and 1. Now, to get the range of Y between 0 and infinity, let’s transform Y Y=0 | 0 Y Now, we have the range between 0 and infinity 1 − Y Y=1 | infinity Let us transform it further, to get the range between –( infinity ) and infinity Y ? = C + B1X1 + B2X2 + …. log log 1 − Y ? − ? www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Logistic Regression? www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Logistic Regression? Logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical. Categorical Dependent Y = f(X) Variables that can have only fixed values such as A, B or C, Yes or No i.e Y is dependent on X. www.edureka.co/data-science Edureka’s Data Science Certification Training
What Is Logistic Regression? Therefore, whenever the outcome of the dependent variable (Y) is categorical, like 0 or 1, Yes or No, A, B or C, we use logistic regression. 1.0 0.0 www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? www.edureka.co/data-science Edureka’s Data Science Certification Training
How does Logistic Regression Work? Let us take an example to understand this: Selected 147, 120, 121, 128, 110, 119, 133 MODEL Not Selected 107, 89, 92, 106, 104, 114 www.edureka.co/data-science Edureka’s Data Science Certification Training
How does Logistic Regression Work? Let us take an example to understand this: Selected 147, 120, 121, 128, 110, 119, 133 MODEL Not Selected 107, 89, 92, 106, 104, 114 www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Let’s take a sample dataset in R, which is called mtcars. Our aim is to predict whether a car will have a V-engine or a Straight engine based on our inputs. Key Mpg - Miles/US Gallon Cyl – Number of cylinders Disp – Number of cylinders Hp – Gross horsepower Drat – Rear axle ratio Wt – Weight (lb/1000) Qsec – 1/4 mile time Vs – V Engine Am – Transmission Type Gear – Number of forward gears Carb - Number of carburetors www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? For now, let’s take disp and wt as our primary independent variables. Why? We’ll be discussing it in our next section. Key Mpg - Miles/US Gallon Cyl – Number of cylinders Disp – Number of cylinders Hp – Gross horsepower Drat – Rear axle ratio Wt – Weight (lb/1000) Qsec – 1/4 mile time Vs – V Engine Am – Transmission Type Gear – Number of forward gears Carb - Number of carburetors www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Since our aim is to know which engine will fit, the engine will either be V – type or not, i.e either 1 or 0. Therefore, our dependent variable is Y. Key Mpg - Miles/US Gallon Cyl – Number of cylinders Disp – Number of cylinders Hp – Gross horsepower Drat – Rear axle ratio Wt – Weight (lb/1000) Qsec – 1/4 mile time Vs – V Engine Am – Transmission Type Gear – Number of forward gears Carb - Number of carburetors www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Before creating the model, we divide our dataset into training and testing. Training Dataset 80 % 20% Testing Dataset www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Training to create our model and testing to validate it. Create model from this Training Dataset 80 % www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? ?° Once the model is created we get the following outputs, which are calculated using MLE*. ?1 ?2 *Maximum Likelihood Estimation is a method of estimating the parameters. www.edureka.co/data-science Edureka’s Data Science Certification Training
Estimated Regression Equation Estimated Regression Equation: ?2? ??° +?1? 1+ 2 Y Logit (Y) = log = 1 − Y ?2? 1 + ??° +?1? 1+ 2 Here, β°= Constant Coefficient ?1= Coefficient of x1 ?2= Coefficient of x2 ?1= Independent variable ?2= Independent variable e = Euler’s Number P(Y) = Probability that Y equals 1 www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Let’s take a value from the test dataset β°= 1.83010 Substituting Values β1= 1.09428 β2= - 0.02529 0.9849 1.9849 = 0.4962 Logit (Y) = e = 2.7183 X1 = 120.3 We will assume the threshold to be 0.5 Probability of ‘vs’ being ‘1’ = 0.4962 X2 = 2.140 Hence our car will not have a VS engine and hence have a straight engine. www.edureka.co/data-science Edureka’s Data Science Certification Training
How Does Logistic Regression Work? Let’s take a value from the test dataset β°= 1.83010 Substituting Values β1= 1.09428 β2= - 0.02529 0.9849 1.9849 = 0.4962 Logit (Y) = e = 2.7183 X1 = 120.3 We will assume the threshold to be 0.5 Probability of ‘vs’ being ‘1’ = 0.4962 X2 = 2.140 Hence our car will not have a VS engine and hence have a straight engine. www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R Our aim is to predict whether a patient is diabetic or not based on the following values. Key Npreg – number of pregnancies Glu – plasma glucose concentration Bp – diastolic blood pressure Skin – triceps skin fold thickness Bmi – body mass index Ped – diabetes pedigree function Age – age in years Type – 1 for yes and 0 for No for diabetic www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R First, we will read the data from our CSV file, by entering this command: www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R First, we will read the data from our CSV file, by entering this command: Then, we will split our dataset into training and testing, with the ratio 8:2 www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R First, we will read the data from our CSV file, by entering this command: Then, we will split our dataset into training and testing, with the ratio 8:2 After that we’ll create our model using the training dataset www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R The summary of the model will give this. www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R The summary of the model will give this. *** - 99.9% confident ** - 99% confident * - 95% confident . - 90% confident www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R • This is the summary model that we get after improving our model. • So the insignificant fields is skin Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) Residual deviance shows how well the response variable is predicted with inclusion of independent variables. www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R We will the predict the values for the test dataset and then categorize them according to threshold which is 0.5 www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R We will the predict the values for the test dataset and then categorize them according to threshold which is 0.5 Create Confusion Matrix for the training dataset www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R We will the predict the values for the test dataset and then categorize them according to threshold which is 0.5 Create Confusion Matrix for the training dataset And then finding the accuracy www.edureka.co/data-science Edureka’s Data Science Certification Training
How To Find The Threshold? www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R Let us take a sample for our significant fields and see whether our patient is diabetic or not based on the model that we created. Store the predicted values for training dataset in ‘res’ variable www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R Let us take a sample for our significant fields and see whether our patient is diabetic or not based on the model that we created. Store the predicted values for training dataset in ‘res’ variable Import the library for the ROCR package www.edureka.co/data-science Edureka’s Data Science Certification Training
Logistic Regression Demo In R Let us take a sample for our significant fields and see whether our patient is diabetic or not based on the model that we created. Store the predicted values for training dataset in ‘res’ variable Import the library for the ROCR package Define the ‘ROCRPred’ and and ‘ROCRPerf’ variables www.edureka.co/data-science Edureka’s Data Science Certification Training