500 likes | 649 Views
DATA REDUCTION (Lecture# 03). Dr. Tahseen Ahmed Jilani Assistant Professor Member IEEE-CIS, IFSA, IRSS Department of Computer Science University of Karachi. Chapter Objectives. Identify the differences in dimensionality reduction based on features, cases, and reduction of value techniques.
E N D
DATA REDUCTION (Lecture# 03) Dr. Tahseen Ahmed Jilani Assistant Professor Member IEEE-CIS, IFSA, IRSS Department of Computer Science University of Karachi
Chapter Objectives • Identify the differences in dimensionality reduction based on features, cases, and reduction of value techniques. • Advantages of dimensionality reduction in the preprocessing of a data mining process could be performed prior to applying the data-mining techniques. • Understanding basic principles of feature-selection and feature composition tasks using corresponding Statistical methods. • Apply Principal component analysis and entropy based techniques and their comparison. Dr. Tahseen A. Jilani-DCS-Uok
Data Reduction • Data Preprocessing steps are sufficient for moderate data sets. • For really large data sets, there is an increased likelihood that an intermediate, additional step-data reduction-should be performed prior to applying the data-mining techniques. • Large data sets have the potential for better mining results, there is no guarantee that they will yield better knowledge than small data sets. • For large databases (datasets), it is possible that huge data have less information (Knowledge) Dr. Tahseen A. Jilani-DCS-Uok
Types of Data Reduction • The three basic operations in a data-reduction process are delete column, delete a row, and reduce the number of values in a column • These operations attempt to preserve the character of the original data by deleting data that are nonessential. • There are other operations that reduce dimensions, but the new data are unrecognizable when compared to the original data set, and these operations are mentioned here just briefly because they are highly application-dependent. Dr. Tahseen A. Jilani-DCS-Uok
Why Data Reduction • Computing time-Simpler data • Predictive/descriptive accuracy it measures how well the data is summarized and generalized into the model. We generally expect that by using only relevant features, a data-mining algorithm can not only learn faster but also with higher accuracy. Irrelevant data may mislead a learning process and a final model, while redundant data may complicate the task of learning and cause unexpected data-mining results. • Representation If the simplicity of representation improves, a relatively small decrease in accuracy may be tolerable. The need for a balanced view between accuracy and simplicity is necessary, and dimensionality reduction is one of the mechanisms for obtaining this balance. Dr. Tahseen A. Jilani-DCS-Uok
Dimension Reduction • The main question is whether some of these prepared and preprocessed data can be discarded without sacrificing the quality of results (Principal of Parsimony) • Can the prepared data be reviewed and a subset found in a reasonable amount of time and space? • If the complexity of algorithms for data reduction increases exponentially, then there is little to gain in reducing dimensions in big data. Dr. Tahseen A. Jilani-DCS-Uok
Dimensions of Large Data Sets • The choice of data representation, and selection, reduction, or transformation of features is probably the most important issue that determines the quality of a data-mining solution. • A large number of features can make available samples of data relatively insufficient for mining. In practice, the number of features can be as many as several hundreds. • If we have only a few hundred samples for analysis, dimensionality reduction is required in order for any reliable model to be mined or to be of any practical use. • On the other hand, data overload, because of high dimensionality, can make some data-mining algorithms non-applicable, and the only solution is again a reduction of data dimensions. Dr. Tahseen A. Jilani-DCS-Uok
Main Objectives in Data Reduction • The three basic operations in a data-reduction process are • Delete a column (Principal Component Analysis) • Delete a row (Profile Analysis, Self Organization Analysis, Classification and Clustering) • Reduce the number of values in a column (smooth a feature). • These operations attempt to preserve the character of the original data by deleting data that are nonessential Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES • A method for unsupervised feature selection or ranking based on entropy measure is a relatively simple technique; but with a large number of features its complexity increases significantly. • The basic assumption is that all samples are given as vectors of a feature's values without any classification of output samples. • The approach is based on the observation that removing an irrelevant feature, a redundant feature, or both from a set may not change the basic characteristics of the data set. • The idea is to remove as many features as possible but yet maintain the level of distinction between the samples in the data set as if no features had been removed. Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES • Algorithm • The algorithm is based on a similarity measure S that is in inverse proportion to the distance D between two n-dimensional samples. • The distance measure D is small for close samples (close to zero) and large for distinct pairs (close to one). When the features are numeric, the similarity measure S of two samples can be defined as • where Dij is the distance between samples xi and xj and α is a parameter mathematically expressed as Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES (Continue) • D is the average distance among samples in the data set. Hence, α is determined by the data. But, in a successfully implemented practical application, it was used a constant value of α = 0.5. Normalized Euclidean distance measure is used to calculate the distance Dij between two samples xi and xj: • where n is the number of dimensions and max(k) and min(k) are maximum and minimum values used for normalization of the k-th dimension. • All features are not numeric. The similarity for nominal variables is measured directly using Hamming distance: Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES (Continue) where • The total number of variables is equal to n. For mixed data, we can discretize numeric values (Binning) and transform numeric features into nominal features before we apply this similarity measure. • Figure 3.1 is an example of a simple data set with three categorical features; corresponding similarities are given in Table 3.1. Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES (Continue) A tabular representation of similarity measures S Dr. Tahseen A. Jilani-DCS-Uok
ENTROPY MEASURE FOR RANKING FEATURES (Continue) • The distribution of all similarities (distances) for a given data set is a characteristic of the organization and order of data in an n-dimensional space. • This organization may be more or less ordered. Changes in the level of order in a data set are the main criteria for inclusion or exclusion of a feature from the features set; these changes may be measured by entropy. • From information theory, we know that entropy is a global measure, • It is less for ordered configurations and higher for disordered configurations. • The proposed technique compares the entropy measure for a given data set before and after removal of a feature. Dr. Tahseen A. Jilani-DCS-Uok
Entropy function • If the two measures are close, then the reduced set of features will satisfactorily approximate the original set. For a data set of N samples, the entropy measure is • where Sij is the similarity between samples xi and xj. This measure is computed in each of the iterations as a basis for deciding the ranking of features. We rank features by gradually removing the least important feature in maintaining the order in the configurations of data. The steps of the algorithm are base on sequential backward ranking, and they have been successfully tested on several real-world applications Dr. Tahseen A. Jilani-DCS-Uok
Entropy function Algorithm • Start with the initial full set of features F. • For each feature, remove one feature f from F and obtain a subset Ff. Find the difference between entropy for F and entropy for all Ff. Let fk be a feature such that the difference between entropy for F and entropy for Ffk is minimum. • Update the set of features F = F – {fk}, where - is a difference operation on sets. In our example, if the difference (EF - EF-F1) is minimum, then the reduced set of features is {F2, F3}. F1 becomes the bottom of the ranked list. • Repeat steps 2-4 until there is only one feature in F. Dr. Tahseen A. Jilani-DCS-Uok
Entropy function Algorithm • A ranking process may be stopped in any iteration, and may be transformed into a process of selecting features, using the additional criterion mentioned in step 4. • This criterion is that the difference between entropy for F and entropy for Ff should be less then the approved threshold value to reduce feature fk from set F. • A computational complexity is the basic disadvantage of this algorithm, and its parallel implementation could overcome the problems of working with large data sets and large number of features sequentially. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis • A Principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are • Data Reduction • Interpretation • If we have p components to describe the complete variability of the system, often much of this variability can be accounted for by a small number of ‘k’ of the principal components. If so, there is (almost) as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variable, and the original data set, consisting of n measurements on p variables, is reduced to a data set consisting of n measurements of k principal components. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result. • Analyses of principal components provides intermediate steps in much larger investigations. For example principal components may be input for a multiple regression model or for cluster analysis or factor analysis. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • Algebraically, principal components are particular linear combination of the p random variables . • Geometrically, these linear combinations represent the selection of a new coordinate system obtained by rotating the original system as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. • The principal components depends solely on the covariance matrix (or the correlation matrix) of Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • The important characteristic is those principal components do not require assumption of multivariate normal distribution. But if data follows multivariate normal distribution then the interpretation about using constant density and making inference using sample principal components. • Let the random vector have the covariance matrix with eigenvalues . • Consider the linear combinations Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • Then, we can obtain • The principal components are those uncorrelated linear combinations whose variances are as large as possible. • The first principal component is the linear combination with maximum variance. That is, it maximizes . • It is clear that can be increased by multiplying any by some constant. To eliminate this indeterminacy it is convenient to restrict attention to coefficient vector of unit length. We therefore define. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • First Principal Component= Linear combination that maximizes subject to . • Second Principal Component= Linear combination that maximizes subject to and • At the ith step that maximizes subject to and . Important Results Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • Proportion of total population variance due to kth principal component • If are the principal components obtained from the covariance matrix , then are the correlation coefficient between and .Here are the eigenvalue-eigenvector pairs. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) Example • Suppose the random variables have the covariance matrix • It may be verified that the eigenvalues-eigenvector pairs are • Therefore, the principal components become Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • Thevariableis one of the principal components, because it is uncorrelated with the other two variables. This implies Furthermore Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) • Therefore, only first two principal components account for 98% of the total variance. In this case the components could replace the original three variables with little loss of information. The correlations of original vectors with principal components are As is neglected so no need to calculate its correlation. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) The number of Principal Components • There is always a question of how many components to retain. There is no definitive answer to this question. Things to consider include • The amount of total sample variance explained • The relative sizes of eigenvalues (or say the variance of the sample components) • The subject-matter interpretations of the components. Dr. Tahseen A. Jilani-DCS-Uok
Principal Component Analysis (Continue) Scree Plot • A useful visual aid to determine an appropriate number of principal components is a Scree plot. With the eigenvalues ordered from largest to smallest, a Scree plot is a plot of versus i- the magnitude of an eigen value versus its number (bend) in the Scree plot. Dr. Tahseen A. Jilani-DCS-Uok
SPSS FACTOR ANALYSIS OF CUSTOMER.SAV Dr. Tahseen A. Jilani-DCS-Uok
SPSS FACTOR ANALYSIS OF CUSTOMER.SAV Dr. Tahseen A. Jilani-DCS-Uok
SPSS FACTOR ANALYSIS OF CUSTOMER.SAV Dr. Tahseen A. Jilani-DCS-Uok
Factor Analysis • Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables. • Factor analysis is often used in data reduction to identify a small number of factors that explain most of the variance that is observed in a much larger number of manifest variables. • Factor analysis can also be used to generate hypotheses regarding causal mechanisms or to screen variables for subsequent analysis (for example, to identify collinearity prior to performing a linear regression analysis). Dr. Tahseen A. Jilani-DCS-Uok
Factor Analysis (Continue) • The factor analysis procedure offers a high degree of flexibility: • Seven methods of factor extraction are available. • Five methods of rotation are available, including direct oblimin and promax for non-orthogonal rotations. • Three methods of computing factor scores are available, and scores can be saved as variables for further analysis. • The essential purpose of factor analysis is t describe, if possible, the covariance (Correlation) relationships among many variables in terms of a few underlying, but unobserved, random quantities called factors. Dr. Tahseen A. Jilani-DCS-Uok
Factor Analysis (Continue) • Factor analysis can be considered an extension of principal component analysis. Both can be viewed as attempts to approximate the covariance matrix. However, the approximation based on the factor analysis is more elaborate. • The primary question in factor analysis is whether the data are consistent with a prescribed structure. Dr. Tahseen A. Jilani-DCS-Uok
The orthogonal Factor Model with p components, has mean and covariance matrix • The observed random vector The factor model postulates that is linearly dependent upon a few unobserved random variables called common factors, and p additional sources of variation called errors or sometimes specific factors . (includes measurement errors). In particular, the factor analysis model is Dr. Tahseen A. Jilani-DCS-Uok
The orthogonal Factor Model (Continue) • In particular, the factor analysis model is • or, in matrix notation • The coefficientsis called loading of the ith variable on the jth factor, so the matrix L is the matrix of factor loading. Dr. Tahseen A. Jilani-DCS-Uok
The orthogonal Factor Model (Continue) • Note that the ith specific factor is associated only with the ith response X. • Here the p deviations (of given data) are expressed in terms ofrandom variables Dr. Tahseen A. Jilani-DCS-Uok
VALUES REDUCTION (BINNING) • A reduction in the number of discrete values for a given feature is based on the second set of techniques in the data-reduction phase; these are the feature-discretization techniques. • The task is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol. • The benefits of these techniques are simplified data description and easy-to-understand data and final data-mining results. Also, more data mining techniques are applicable with discrete feature values. An "old fashioned" discretization is made manually, based on our a priori knowledge about the feature. Dr. Tahseen A. Jilani-DCS-Uok
VALUES REDUCTION (BINNING) Example • Example, Binning Age Feature values Given the continuous/measurable nature at the beginning of a data-mining process for age feature (between 0 and 150 years) may be classified into categorical segments: child, adolescent, adult, middle age, and elderly. Cut off points are subjectively defined. Two main questions exist about this reduction process: • What are the cut-off points? • How does one select representatives of intervals? Dr. Tahseen A. Jilani-DCS-Uok
VALUES REDUCTION (BINNING) Note: A reduction in feature values usually is not harmful for real-world data-mining applications, and it leads to a major decrease in computational complexity. Dr. Tahseen A. Jilani-DCS-Uok
BINNING- Continue Dr. Tahseen A. Jilani-DCS-Uok
BINNING- Continue Dr. Tahseen A. Jilani-DCS-Uok
Feature Discretization: CHI-MERGE Technique • An automated discretization algorithm that analyzes the quality of multiple intervals for a given feature by using χ2 statistics. • The algorithm determines similarities between distributions of data in two adjacent intervals based on output classification of samples. • If null hypothesis is true then the two consecutive intervals are merged to form a single big interval. Assuming the the intervals are non-overlapping. Dr. Tahseen A. Jilani-DCS-Uok
CHI-MERGE Technique (Continue) Dr. Tahseen A. Jilani-DCS-Uok
CHI-MERGE Technique (Continue) Dr. Tahseen A. Jilani-DCS-Uok
VALUES REDUCTION (BINNING) Dr. Tahseen A. Jilani-DCS-Uok
References • Mehmed Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms, John Wiley & Sons, 2003. • William Johnson, Applied Multivariate Analysis, Parson’s Education, Low Price Edition, 2005. Dr. Tahseen A. Jilani-DCS-Uok
Thank You Dr. Tahseen A. Jilani-DCS-Uok