1 / 53

Exploratory Data Analysis Continued

Exploratory Data Analysis Continued. Exploratory Data Analysis. Set of techniques The flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute Free to take many paths in revealing mysteries in the data

Download Presentation

Exploratory Data Analysis Continued

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Data Analysis Continued

  2. Exploratory Data Analysis • Set of techniques • The flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute • Free to take many paths in revealing mysteries in the data • Emphasizes visual representations and graphical techniques over summary statistics

  3. EDA • Summary statistics , may obscure, conceal the underlying structure of the data • When numerical summaries are used exclusively and accepted without visual inspection, the selection of confirmatory modes may be based on flawed assumptions and may produce erroneous conclusions

  4. Previously Discussed Techniques for Displaying Data • Frequency Tables • Bar Charts (Histograms) • Pie Charts • Stem and Leaf Displays • Boxplots

  5. Resistant Statistics • Example: data set = [5,6,6,7,7,7,8,8,9] • The mean is 7 and the standard deviation 1.23 • Replace the 9 with 90 and the mean becomes 16 and the standard deviation 27.78. • Changing only one of the nine values has disturbed the location and spread summaries to the point where they no longer represent the other eight values. Both mean and standard deviation are considered nonresistant statistics • The median remained at 7 and the lower and upper quartiles stayed at 6 and 8, respectively.

  6. Visual Techniques of EDA • Gain insight into the data • More common ways of summarizing location, spread, and shape • Used resistant statistics • From these we could make decisions on test selection and whether the data should be transformed or reexpressed before further analysis

  7. More Techniques • Last section focused on primarily single-variable distributions • inspect relationships between and among variables

  8. Crosstabulation • Technique for comparing two classification variables • uses tables having rows and columns that correspond to the levels or values of each of the variable’s categories

  9. Example of a Crosstabulation Oversees Assignment YES NO Row Total Gender Male 22 40 62 Row % 35.5 64.5 62.0 Col % 78.6 55.6 Tot % 22.0 40.0 Female 6 32 38 Row % 15.8 84.2 38.0 Col % 21.4 44.4 Tot % 6.0 32.0 Column 28 72 100 28.0 72.0 100.0

  10. The Use of Percentages • simplify the data by reducing all numbers to a range from 0 to 100 • translate the data into standard form, with a base of 100, for relative comparisons • A raw count has little value unless we know it is from a sample of 100 (28%) • while this is useful, it even more useful when the research calls for a comparison of several distributions of the data

  11. Comparison of a Crosstabulations Oversees Assignment YES NO Row Total Gender Male 225 675 900 Row % 25.0 75.0 60.0 Col % 62.5 59.2 Tot % 15.0 45.0 Female 135 465 600 Row % 22.5 77.5 40.0 Col % 37.5 40.8 Tot % 9.0 31.0 Column 360 1140 1500 24.0 76.0 100.0

  12. Use of Percentages • Comparing the present sample (100) and the previous sample (1500), we can view the relative relationships and shifts in the data. • In comparing two-dimensional tables, the selection of either the row or the column will accentuate a particular distribution or comparison. ( Note in our last tables both column and row were presented)

  13. Presenting Percentages • When one variable is hypothesized to the presumed cause, it is thought to affect or predict a response, label it the independent variable and % should be computed in the direction of this variable • Which direction should the last example(s), gender by oversees assignment run?

  14. Independent Variable • (row) - the implication is that gender influences selection for oversees assignments • if you said column, you are implying that the assignment status has some effect on the gender and this is implausible! • Note that you can do the calculations, but they may not make sense!

  15. Other Guidelines for Percentages • Averages percentages: Percentages cannot be averaged unless each is weighted by the size of the group from which it is derived. (weighted average) • Use of too large percentages: A large percentage is difficult to understand. If a 1000 % increase, better to state it as a tenfold increase.

  16. Other Guidelines for Percentages • Using too small of a base: Percentages hide the base from which they have been computed • Percentage decrease can never exceed 100 percent. The higher figure should be always used as the base.

  17. Other Table-Based Analysis • Recognition of a meaningful relationship between variables generally signals a need for further investigation. • Even if one finds a statistically significant relationship, the questions of why and under what conditions remain. • Normally introduce a control variable • Statistical packages can handle complex tables

  18. Control and Nested Variables Control Variable Category 1 Category 2 Nested Variable Nested Variable Cat 1 Cat 2 Cat 3 Cat 1 Cat 2 Cat 3 Cells ... labels

  19. Data Mining • Describes the concept of discovering knowledge from databases • the idea behind it is the process of identifying valid, novel, useful, and ultimately understandable patterns in data • provides two unique capabilities to the researcher • pattern discovery • predicting trend and behavior

  20. Data-Mining Process Investigative Question Sampling yes/no Data Visualization Clustering, factor correspondence Variable selection, creation Data Transformation Neural Networks Tree- based models Classification Models Other Stat Models Model Assessment

  21. Sampling Yes/No • Use the entire set or a sample of the data • if fast turnaround is more important than absolute accuracy, sampling may be appropriate • Sample - if data set is large - terabytes

  22. Modify • Based on discoveries, data may require modification • Clustering, factor, correspondence analysis • Variable selection, creation • Data transformation

  23. Factor Analysis • General term for several specific computational techniques • All have the objective of reducing to a manageable number many variables that belong together and have overlapping measurement characteristics

  24. Factor Analysis Method • Begins with construction of a new set variables based on the relationships in the correlation matrix • Can be done in a variety of ways • most popular is principal components analysis.

  25. Principal Components Analysis • Transforms a set of variables into a new set that are not correlated with each other. • These linear combinations of variables, called factors, account for the variance in the data as a whole. • All factors being the best linear combination of variables not accounted for by previous factors

  26. Principal Components Analysis • Process continues until all the variance is accounted for Extracted components % of variance cumulative accounted for variance Component 1 63% 63% Component 2 29 92 Component 3 8 100

  27. Cluster Analysis • Unlike the techniques for analyzing the relationships between variables • Set of techniques for grouping similar objects • Cluster starts with a undifferentiated group • Different that discriminant analysis where you search for set of variables to separate them

  28. Cluster Analysis Method • Select the sample (employees, buyers) • Definition of the variables on which to measure the objects • Computation of similarities amount entities through correlation, Euclidean distances and other techniques • Selection of mutually exclusive clusters ( maximization of within-cluster similarity and between-cluster differences) • Cluster comparison and validation

  29. Clustering Different methods produce different solutions • Cluster analysis methods are not clearly established. There are many options one may select when doing a cluster analysis using a statistical package. Cluster analysis is thus open to the criticism that a statistician may mine the data trying different methods of computing the proximities matrix and linking groups until he or she "discovers" the structure that he or she originally believed was contained in the data. One wonders why anyone would bother to do a cluster analysis for such a purpose.

  30. A Very Simple Cluster Analysis • In cases of one or two measures, a visual inspection of the data using a frequency polygon or scatterplot often provides a clear picture of grouping possibilities. For example, "Example Assignment" is data from a cluster analysis homework assignment.

  31. It is fairly clear from this picture that two subgroups, the first including Julie, John, and Ryan and the second including everyone else except Dave describe the data fairly well. • When faced with complex multivariate data, such visualization procedures are not available and computer programs assist in assigning objects to groups.

  32. Dendogram The clusters and their relative distances are displayed in a diagram called a dendogram

  33. The following HTML page describes the logic involved in cluster analysis algorithms. http://www.cs.bsu.edu/homepages/dmz/cs689/ppt/entire_cluster_example.html

  34. Correspondence Analysis • a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. • provide information which is similar in nature to those produced by factor analysis techniques • allow one to explore the structure of categorical variables included in the table. • The most common kind of table of this type is the two-way frequency crosstabulation table • See http://www.statsoft.com/textbook/stcoran.html

  35. Variable Selection, Creation • If important constructs were discovered, new factors would be introduced to categorize the data • Some may be dropped

  36. WinSTAThttp://www.winstat.com/ Welcome! (text from their home page) WinSTAT is the statistics Add-In for Microsoft Excel, and this is the place to find out all about it. Tired of your hard-to-use, need-to-be-a-fulltime-expert statistics package? Find out why WinSTAT is the program for you. Wondering if WinSTAT covers the functions and graphics you need? Let the function reference page surprise you, complete with sample outputs of tables and graphics for all functions. Still not convinced? There's no way to be sure until you've tried WinSTAT for yourself. We've got the demo download right here. Dmz Note WinSTAT also does clustering, factor analysis, and the usual EDA techniques

  37. Model • If a complex predictive model is needed, the researcher will move to the next step of the process, building a model • Modeling techniques include, neural networks, decision tree, sequence-based, classification and estimation

  38. Neural Networks • Also called artificial neural networks (ANN) • Collections of simple processing nodes that are connected • Each node operates only its local data and on the inputs it receives through connections • The result is a nonlinear predictive model that resembles biological neural networks and learns through training.

  39. Neural Networks • The neural model has to train its network on a training data set.

  40. Tree Models • Segregates data by using a hierarchy of if-then statements based on the values of variables and creates a tree-shaped structure that represents the segregation decisions.

  41. Classification –Sky Surveying Cataloging • To predict class (star or galaxy) of sky objects, especially faint ones, based on telescopic survey images (from Palomar Observatory) • 3000 images with 23,040 x 23,040 pixels per image • Approach: • Segment the image • Measure the image attributes (features) 40 of them per object. • Model the class based on these features • Success Story: Could find 16 new red-shift quasars, some of the farthest objects that are difficult to find

  42. Estimation • Variation of classification • Instead of just “yes” or ‘no” outcome, generates a score

  43. Other Mining Techniques • Association – find patterns across transactions, patterns • Bundling of services • Sequence-based analysis – takes into account not only the combination of items but also the order of the items • In health care, can be used to predict the course of a disease and order preventive care • Fuzzy logic – extension of Boolean – can have truth values between completely true and completely false • Fractal-based transformation – work on gigabytes of data, offering the possibility of identify tiny subsets of data that have common characteristics

More Related