200 likes | 237 Views
Explore statistical inference methods, parameters, and classifiers in data mining. Learn regression, ANOVA, and more. Understand Bayesian inference and predictive regression techniques.
Statistical Methods Chichang Jou Tamkang University
Chapter Objectives • Explain methods of statistical inference in data mining • Identify different statistical parameters for accessing differences in data sets • Describe Naïve Bayesian Classifier and the logistic regression method • Introduce log-linear models using correspondence analysis of contingency tables • Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples
Background • Statistics is to collect and organize data and draw conclusions from data sets • Descriptive Statistics: • Organization and description of the general characteristics of data sets • Statistical Inference: • Draw conclusions from data • Main focus of this chapter
5.1 Statistical Inference • We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population • Sample in Statistics • Describes a finite data set of n-dimensional vectors • Will be called data set • Biased • Any sampling procedure that produces inferences that consistently overestimate or underestimate some characteristics of the population
5.1 Statistical Inference • Statistical Inference is the main form of reasoning relevant to data analysis • Statistical Inference methods are categorized as • Estimation: • Goal: make the expected prediction error close to 0 • Regression vs. classification • Tests of hypothesis • Null hypothesis H0: any hypothesis we wish to test • The rejection of H0 leads to the acceptance of an alternative hypothesis
5.2 Assessing Difference in data sets • Mean • Median: better for skewed data • Mode: the value that occurs most frequently • For unimodal frequency curves that are moderately asymmetrical, the following empirical relation is useful: mean –mode = 3 x (mean –median) • Standard deviation σ (variance: σ2)
5.3 Bayesian Inference • Prior distribution: given probability distribution for the analyzed data set • Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. P( H / X) = [ P( X / H) ˙ P(H)]/P(X) • See p.97 for an example of Naïve Bayesian Classifier P( Ci / X) = [ P( X / Ci) ˙ P(Ci)]/P(X) P( X /Ci ) = • Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence.
5.4 Predictive Regression • Common reasons for performing regression analysis • The output is expensive to measure • The values of the inputs are known before the output is known, and a working prediction of the output is required • Controlling the input values to predict the behavior of corresponding outputs • To identify the causal link between some of the inputs and the output
Linear regression • Y=α+β1X1+β2X2+…+βnXn • Applied to each sample • yj=α+β1x1j+β2x2j+…+βnxnj+εj • Example with one input variable (p.99 –p.100) • Y=α+βX • The sum of squares of errors (SSE) • Differentiate SSE w.r.t. α and β, and set them to 0 • Equations for α and β error
General Linear Model • For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. • Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) • The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. • Sequential search approach • Combinatorial approach
Quality of linear regression • Correlation coefficient r
5.5 Analysis of Variance (ANOVA) • ANOVA is a method of identifying which of the β’s in a linear regression model are non-zero. • Residues: • Ri = yi– f(xi) • Thevariance is estimated by: • S2 allows us to compare different linear models • Only if the fitted model does not include inputs that it ought to, will S2 tend to be significantly larger than σ2
ANOVA algorithm • First start with all inputs and compute S2 • Omit inputs from the model one by one (This means forcing the corresponding βi to 0) • If we omit a useful input, the new estimate S2 will significantly increase • If we omit a redundant input, the new estimate S2 will not change much • F-ratio (example in p.105) • Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA)
Linear regression is used to model continuous-value functions. Generalized regression models try to apply linear regression to model categorical response variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. It tries to estimate the probability p that the dependent (output) variable will have a given value. If p is greater than 0.5, then the prediction is closer to YES It supports a more general input data set by allowing both categorical and quantitative inputs 5.6 Logistic Regression
Logistic Regression • P(yj=1)=pj, P(yj=0)=1-pj • The linear logistic model • This is to prevent pj from going out of range • Example (p. 107) • Suppose logit(p) = 1.5 - 0.6 x1 + 0.4 x2 -0.3 x3 • With (x1, x2, x3) = (1,0,1) • p=0.35 • Y=1 is less probable than Y=0
5.7 Log-Linear Models • Log-linear modeling is a generalized linear model where the output Yi is assumed to have a Poisson distribution, with expected value μj • It is to analyze the relationship between categorical (or quantitative) variables • It approximates discrete, multi-dimensional probability distributions
Log-Linear Models • log(μj) is assumed to be linear function of inputs • We need to find which β’s are 0 • If βi is 0, then Xi is not related to other input variables • Correspondence Analysis: • Log-linear models when no output variable is defined • Use contingency tables to answer the question: Any relationship between the attributes?
Correspondence Analysis • Transform a given contingency table into a table with expected values, under the assumption that the input variables are independent • Compare these two metrics using the squared distance measure and the chi-square test Example: p. 108 , p. 111
5.8 Linear Discriminant Analysis • Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric • LDA is to construct a discriminant function that yields different scores when computed with data from different output classes Fig. 5.3
Linear Discriminant Analysis • LDA tries to find a set of weight values wi that maximizes the ratio of the between-class to the within-classvariance of the discriminant score for a pre-classified set of samples. It is then used to predict. • Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. • Let zA and zB be the mean discriminant score of pre-classifed samples frm classes A and B. • If the two classes of samples are of equal size and are uniformly distributed • If the two classes of samples are not of equal size • Multiple discriminant analysis (p. 113)