1 / 51

Decision Tree and Naïve Bayes Classifier for Weather Prediction

This project involves building and comparing a Decision Tree and Naïve Bayes classifier to predict weather conditions based on a provided dataset. The Decision Tree classifier achieved 60% accuracy, while the Naïve Bayes classifier yielded similar results. The use of these classifiers demonstrates their application in predictive modeling for weather forecasting.

Download Presentation

Decision Tree and Naïve Bayes Classifier for Weather Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DECISION TREE

  2. DECISION TREE

  3. DATASET no 6 85 85 0 no 6 80 90 1 yes 4 83 86 0 yes 2 70 96 0 yes 2 68 80 0 no 2 65 70 1 yes 4 64 65 1 no 6 72 95 0 yes 6 69 70 0 yes 2 75 80 0 yes 6 75 70 1 yes 4 72 90 1 yes 4 81 75 0 no 2 71 91 1

  4. import numpy as np • import pandas as pd • from sklearn.model_selection import train_test_split • from sklearn.tree import DecisionTreeClassifier • from sklearn.metrics import accuracy_score

  5. balance_data = pd.read_csv('C:\Python\weather.csv',header= None) print(balance_data) 0 1 2 3 4 0 no 6 85 85 0 1 no 6 80 90 1 2 yes 4 83 86 0 3 yes 2 70 96 0 4 yes 2 68 80 0 5 no 2 65 70 1 6 yes 4 64 65 1 7 no 6 72 95 0 8 yes 6 69 70 0 9 yes 2 75 80 0 10 yes 6 75 70 1 11 yes 4 72 90 1 12 yes 4 81 75 0 13 no 2 71 91 1

  6. print("Dataset Lenght:: ", len(balance_data)) print("Dataset Shape:: ", balance_data.shape) Dataset Lenght:: 14 Dataset Shape:: (14, 5)

  7. X = balance_data.values[:,1:5] Y = balance_data.values[:,0] print (X, Y) [[6 85 85 0] [6 80 90 1] [4 83 86 0] [2 70 96 0] [2 68 80 0] [2 65 70 1] [4 64 65 1] [6 72 95 0] [6 69 70 0] [2 75 80 0] [6 75 70 1] [4 72 90 1] [4 81 75 0] [2 71 91 1]] ['no' 'no' 'yes' 'yes' 'yes' 'no' 'yes' 'no' 'yes' 'yes' 'yes' 'yes' 'yes' 'no']

  8. X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5) clf_entropy.fit(X_train, y_train)

  9. y_pred = clf_entropy.predict(X_test) print(y_pred) ['yes' 'yes' 'yes' 'yes' 'yes']

  10. print("Accuracy is ", accuracy_score(y_test,y_pred)*100) Accuracy is 60.0

  11. print(clf_entropy.predict([[6, 75, 70, 1]])) ['yes']

  12. Naïve Bayes classifier

  13. Bayes’ Theorem

  14. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score from sklearn import metrics

  15. balance_data = pd.read_csv('C:\Python\weather.csv',header= None) print(balance_data) 0 1 2 3 4 0 no 6 85 85 0 1 no 6 80 90 1 2 yes 4 83 86 0 3 yes 2 70 96 0 4 yes 2 68 80 0 5 no 2 65 70 1 6 yes 4 64 65 1 7 no 6 72 95 0 8 yes 6 69 70 0 9 yes 2 75 80 0 10 yes 6 75 70 1 11 yes 4 72 90 1 12 yes 4 81 75 0 13 no 2 71 91 1

  16. balance_data = pd.read_csv('C:\Python\weather.csv',header= None) print(balance_data) 0 1 2 3 4 0 no 6 85 85 0 1 no 6 80 90 1 2 yes 4 83 86 0 3 yes 2 70 96 0 4 yes 2 68 80 0 5 no 2 65 70 1 6 yes 4 64 65 1 7 no 6 72 95 0 8 yes 6 69 70 0 9 yes 2 75 80 0 10 yes 6 75 70 1 11 yes 4 72 90 1 12 yes 4 81 75 0 13 no 2 71 91 1

  17. print("Dataset Lenght:: ", len(balance_data)) print("Dataset Shape:: ", balance_data.shape) Dataset Lenght:: 14 Dataset Shape:: (14, 5)

  18. X = balance_data.values[:,1:5] Y = balance_data.values[:,0] print (X, Y) [[6 85 85 0] [6 80 90 1] [4 83 86 0] [2 70 96 0] [2 68 80 0] [2 65 70 1] [4 64 65 1] [6 72 95 0] [6 69 70 0] [2 75 80 0] [6 75 70 1] [4 72 90 1] [4 81 75 0] [2 71 91 1]] ['no' 'no' 'yes' 'yes' 'yes' 'no' 'yes' 'no' 'yes' 'yes' 'yes' 'yes' 'yes' 'no']

  19. X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) model = GaussianNB() model.fit(X_train, y_train)

  20. y_pred = model.predict(X_test) print(y_pred) • ['no' 'yes' 'yes' 'no' 'yes']

  21. print(model.predict([[2, 71, 91, 1]])) ['no']

  22. print("Accuracy is ", accuracy_score(y_test,y_pred)*100) Accuracy is 60.0

  23. print(metrics.classification_report(y_test,y_pred)) precision recall f1-score support no 0.50 0.50 0.50 2 yes 0.67 0.67 0.67 3 micro avg0.60 0.60 0.60 5 macro avg 0.58 0.58 0.58 5 weighted avg0.60 0.60 0.60 5

  24. print(metrics.confusion_matrix(y_test,y_pred)) [[1 1] [1 2]]

  25. SVM (SUPPORT VECTOR MACHINE)

  26. import numpy as np import matplotlib.pyplot as plt from sklearn import svm x = [1, 5, 1.5, 8, 1, 9] y = [2, 8, 1.8, 8, 0.6, 11] plt.scatter(x,y)

  27. X = np.array([[1,2], [5,8], [1.5,1.8], [8,8], [1,0.6], [9,11]]) y = [0,1,0,1,0,1] clf = svm.SVC(kernel='linear') clf.fit(X,y)

  28. print(clf.predict([[0.58,0.76]])) print(clf.predict([[10.58,10.76]])) [0] [1]

  29. Introduction to Regression Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy. What is Regression? In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables.  

  30. Regression model • Relation between variables where changes in some variables may “explain” or possibly “cause” changes in other variables. • Explanatory variables are termed as independentvariables and the variables to be explained are termed as dependentvariables. • Regression model estimates the nature of the relationship between the independent and dependent variables.

  31. Examples • Dependent variable is retail price of gasoline in Regina – independent variable is the price of crude oil. • Dependent variable is employment income – independent variables might be hours of work, education, occupation, gender, age, region, years of experience, unionization status, etc. • Price of a product and quantity produced or sold: • Quantity sold affected by price. Dependent variable is quantity of product sold – independent variable is price. • Price affected by quantity offered for sale. Dependent variable is price – independent variable is quantity sold.

  32. Uses of regression • Amount of change in a dependent variable that results from changes in the independent variable(s) – can be used to estimate elasticities, returns on investment in human capital, etc. • Prediction and forecasting of sales, economic growth, etc. • Support or negate theoretical model. • Modify and improve theoretical models and explanations of phenomena.

  33. simple linear regression • x is the independent variable • y is the dependent variable • The regression model is • The model has two variables, the independent or explanatory variable, x, and the dependent variable y, the variable whose variation is to be explained. • The relationship between x and y is a linear or straight line relationship. • Two parameters to estimate – the slope of the line β1 and the y-intercept β0 (where the line crosses the vertical axis). • ε is random, or error component.

  34. Linear Regression

  35. Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression, models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.

  36. Linear regression is used to estimate real world values like cost of houses, number of calls, total sales etc. based on continuous variable. Here, we establish relationship between dependent and independent variables by fitting a best line. This line of best fit is known as regression line and is represented by the linear equation Y= a *X + b. Y – Dependent Variable a – Slope X – Independent variable b – Intercept These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

  37. import matplotlib.pyplot as plt import numpy as np from sklearn import linear_model height=[[4.0],[4.5],[5.0],[5.2],[5.4],[5.8],[6.1],[6.2],[6.4],[6.8]] weight=[42, 44, 49, 53, 55, 58, 60, 64, 66, 69]

  38. plt.scatter(height, weight, color='black') plt.xlabel("height") plt.ylabel("weight") plt.show()

  39. lin_reg=linear_model.LinearRegression() lin_reg.fit(height,weight) m=lin_reg.coef_ b=lin_reg.intercept_ print("slope=", m, "intercept=", b) slope= [10.25056948] intercept= -0.7881548974942945

  40. pred_values=[lin_reg.coef_ * i + lin_reg.intercept_ for i in height] plt.xlabel("height") plt.ylabel("weight") plt.plot(height, pred_values, 'b')

  41. w1=lin_reg.predict(np.array([[10.2]])) print("the weight is",w1) the weight is [103.76765376]

  42. Logistic Regression

  43. Logistic regression • Logistic regression is another technique borrowed by machine learning from statistics. It is the preferred method for binary classification problems, that is, problems with two class values. • It is a classification algorithm and not a regression algorithm as the name says. It is used to estimate discrete values or values like 0/1, Y/N, T/F based on the given set of independent variable . It predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also called logit regression. Since, it predicts the probability, its output values lie between 0 and 1. • Note that taking a log is one of the best mathematical way to replicate a step function.

  44. The following points may be note-worthy when working on logistic regression It is similar to regression in that the objective is to find the values for the coefficients that weigh each input variable. Unlike in linear regression, the prediction for the output is found using a non-linear function called the logistic function. The logistic function appears like a big ‘S’ and will change any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to assign values to 0 and 1 and predict a class value. The way the logistic regression model is learned, the predictions made by it can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful for problems where you need to give more reasoning for a prediction. Like linear regression, logistic regression works better when unrelated attributes of o/p variable is removed and similar attributes are removed.

  45. PYTHON CODE FOR LIGISTIC REGRESSION import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

  46. age insurance 0 22 0 1 25 0 2 47 1 3 52 0 4 46 1 5 56 1 6 55 0 7 60 1 8 62 1 9 61 1 10 18 0 11 28 0 12 27 0 13 29 0 14 49 1 15 55 1 16 25 1 17 58 1 18 19 0 19 18 0 20 21 0 21 26 0 22 40 1 23 45 1 24 50 1 25 54 1 26 23 0 df = pd.read_csv('C:\Python\insurance_data.csv') print(df)

  47. plt.scatter(df.age, df.insurance, marker='+', color='red')

  48. X_train, X_test, y_train, y_test = train_test_split( df[['age']], df.insurance, test_size=0.1) model=LogisticRegression() model.fit(X_train, y_train) m1=model.predict(X_test) print(m1) [1 1 0]

More Related