90 likes | 188 Views
COSC 6335 Fall 2013 Post Analysis Project1. Christoph F. Eick. Post Analysis Project1. Disclaimer The main purpose of these slides is not criticize groups but rather to learn how to do a better job when analyzing data and interpreting data mining results.
E N D
COSC 6335Fall 2013Post Analysis Project1 Christoph F. Eick
Post Analysis Project1 Disclaimer • The main purpose of these slides is not criticize groups but rather to learn how to do a better job when analyzing data and interpreting data mining results. • Most of you do not have much experience in these tasks • Learning without making errors is impossible; therefore, students can benefit from discussing errors of other students Visualization • Use large, high resolution displays—some students used displays that did not reveal much because of too high density. • Quality of the visualization impacts what you are able to see • If you compare displays, put them next to each other!! • Use the same coordinate systems/scale in displays you compare
Post Analysis Project1 Part2 Interpretation • Scatterplot: the key question is if the attribute/pair of attributes can provide some evidence for the dominance of a particular class in a particular region in the attribute space; not if the attribute pair clearly separates the classes. • Vague interpretation of quantitative results; e.g. “Att1 seems to be more important that Att2” versus “the fact the regression coefficient of Att1 is 12 times as large as the regression coefficient of Att2 suggest that attribute Att1 has a much stronger impact on class membership”. • Overlooking patterns in displays; e.g. regions that are dominated by one class or only looking for pattern in E/W direction when there are also clear patterns in N/S direction. • Not giving summaries at all or giving very “quick” summaries
Regression Results Mean Value No Scaling: R2: Multiple R-squared: 0.286 Adjusted R-squared: 0.282 Coefficients: (Intercept) V2 V3 V6 V7 -0.9930791 0.0066490 0.0006933 0.0126270 0.1399540 With Scaling: The fact that the R2 is 0.28 suggests that the results a suggestive but do not Indicate a strong finding about the importance of the attributes.
Box Plots Thanks to Group 10!
Post Analysis Project1 Part3 Statistical Summaries • If there are minor disagreement I took away 1 point • If the results do not make any sense, I took away a lot of points (only happened once) • If it was not clear how the results were generated (no R-code or incomplete R-code or lack of explanation), I also took away points. Other • You were also supposed to interpret the histograms, but the project specification failed to ask you to do that! discuss another example inReview2 Importance of Attributes • GC is definitely very helpful for diagnosing diabetes (scatter plot, regression); e.g. if it is quite low, it is very unlikely that the person has diabetes (useful for diabetes test) • BMI (boxplot, scatterplot, regression coefficients) and to a lesser extend Pedigree have some usefulness in diagnosing diabetes. • No evidence has been suggested by any group that DBP has any usefulness in diagnosing diabetes, although it has a week positive correlation of 0.28 with BMI
Post Analysis Project1 Part4 Linear Regression • If you do not scale data, interpretation of the observed coefficients is quite complicated (see previous slide). • Lack of quantitative assessment of results Star Plots • What is in your opinion the usefulness of this techniques? • I myself have difficulties making sense of those, but some of you do seem to like Star Plots much more... Conclusion/Other Findings • Half of the groups of quite short conclusions and most summaries are somewhat vague; e.g. they do not write about • The importance/usefulness of the attributes • The usefulness of the employed techniques • Knowledge about diabetes generated in Project1 • … Project Weights Fall 2013 Project2>Project3??>Project4 Project1