1 / 47

Perspectives on the Interestingness of Data Patterns

V. E. Middleton Enterprises, LLC. Perspectives on the Interestingness of Data Patterns. Victor E. Middleton 16 April 2009. Briefing Outline. Data Mining & Data Farming Background My Perspective Some Miscellaneous Musings on Statistical Tools Some Research Results

zuwena
Download Presentation

Perspectives on the Interestingness of Data Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V. E. Middleton Enterprises, LLC. Perspectives on the Interestingness of Data Patterns Victor E. Middleton 16 April 2009

  2. Briefing Outline • Data Mining & Data Farming Background • My Perspective • Some Miscellaneous Musings on Statistical Tools • Some Research Results • Data Mining: You and Dr.Liu

  3. Data Farming & Data Mining • Data Farming - generation of new data from execution of simulations • Data Mining - collecting/processing the information “nuggets” from these simulation data Note: The World is drowning in data. We need to provide Information!

  4. Brief History of Data Mining and Data Farming Adapted from : An Introduction to Data Mining http://www.thearling.com/text/dmwhite/dmwhite.htm

  5. Early Data Mining John Snow and the Broad Street Cholera Outbreak of 1854 http://www.winwaed.com/sci/cholera/john_snow.shtml

  6. The Age of the Petabyte The End of Theory: The Data Deluge Makes the Scientific Method Obsolete* Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn't just more. More is different. * http://www.wired.com/science/discoveries/magazine/16-07/pb_intro#ixzz0kW6HU5JP

  7. What’s a Petabyte http://www.wired.com/science/discoveries/magazine/16-07/pb_intro#ixzz0kW6HU5JP

  8. My Perspective:Data Roles in Model Development & Application • Phenomenology Characterization • Model Component Definition • Model Calibration • Model Validation • Scenario Generation and Exploration

  9. Turning Data into Information Analysis Assessment Model Development Application

  10. The Path from Data to Information Collect the Data Experiment & Field Trials Summarize the Data Descriptive Statistics Interpret the Data Exploratory Data Analysis Reach Conclusions about the Data Inferential Statistics Extend the Data Modeling and Simulation Harvest the Data Data Farming & Data Mining

  11. Tools: Statistical Analysis • Descriptive Statistics • Measures of central tendency • Measures of variation • Inferential Statistics • Point estimates and confidence intervals • ANOVA • Maximum likelihood • Hypothesis testing (confirmatory data analysis and statistical significance) • Time series & trend analysis • Regression and prediction • Exploratory Data Analysis • Blending of descriptive and inferential statistics for qualitative characterization of data • Emphasizes visualization of the data – e.g.; scatter plots, histograms

  12. Why Qualitative Analysis? Data Set 1 ・N = 11 ・Mean of X = 9.0 ・Mean of Y = 7.5 ・Intercept = 3 ・Slope = 0.5 ・Residual standard deviation = 1.237 ・Correlation = 0.816 Data Set 2 ・N = 11 ・Mean of X = 9.0 ・Mean of Y = 7.5 ・Intercept = 3 ・Slope = 0.5 ・Residual standard deviation = 1.237 ・Correlation = 0.816 Data Set 3 ・N = 11 ・Mean of X = 9.0 ・Mean of Y = 7.5 ・Intercept = 3 ・Slope = 0.5 ・Residual standard deviation = 1.236 ・Correlation = 0.816 Data Set 4 ・N = 11 ・Mean of X = 9.0 ・Mean of Y = 7.5 ・Intercept = 3 ・Slope = 0.5 ・Residual standard deviation = 1.236 ・Correlation = 0.817 Source F.J. Anscombe, “Graphs in Statistical Analysis”, American Statistician, Feb 1973, pp17-21.

  13. Graphical Methods

  14. Data “Goodness” Error, Applicability, and Usefulness Picture Source: Taylor; An Introduction to Error Analysis;1982

  15. Does Data Mining Have a Role in Quality Assurance? • Did we get the data we need? • Were our experimental designs adequate? • Were our experimental protocols good? • Can we rely on the data we got? • What do the data tell us about the phenomena we want to model? • What is the range of applicability of empirical models? • Can we use the data to support first principal models? What is the validity/range of applicability of those models? • What do our derivative models provide us that the raw data don’t?

  16. Data “Goodness” • Experimental variation & error • Representational robustness - how well does it capture the phenomena of interest • Data density, scale, and resolution • Applicability/validity • “Delta benefit” – do these data confirm/improve previous work or tell us something we didn’t know before Bottom Line: Confidence in the data available to support operational and programmatic decision-makers

  17. Terminology • Accuracy uncertainty in the measurements made with a given instrument, degree to which information is different from the true value • Precision degree of stability in measurement, expected scatter or spread of repeated measurements of a fixed quantity • Bias difference between measurement mean and true value or accepted standard

  18. Accuracy vs. Precision Accurate Precise Accurate Imprecise Biased: Precise Inaccurate Inaccurate Imprecise

  19. More Terminology • Resolution smallest change or increment in measured quantity that can be detected by the measuring instrument • Robustness degree to which measurement instrument can be applied outside its development domain and its relative stability/brittleness outside of domain boundaries

  20. ? Resolution, Robustness and Applicability • Experimental/Field Data • Limited, in both factor levels and sample sizes • More suited to interpolation than extrapolation - inference to new conditions may be difficult • Model Results • Validity derives from both underlying science & supporting data • Model inputs may be based on incomplete, imprecise, or uncertain data • Usefulness is a function of confidence and accessibility

  21. Assessment Levels Data Collection “Eaches” Interface/Aggregation Model Development Model Verification & Validation Model Output & Application Assessment Criteria Adequacy of Experimental Design Data Quality Quantity/Sufficiency Applicability Validity Robustness Degree of Fit Agreement Delta contributions Data Assessment Multiple Tiers of Assessment {

  22. A Final Caution“How good is good enough?” Accuracy?

  23. Integration of Simulation and Data Mining • Goals : • (i)Developing a basic understandingof a particular model or system; • seeking insightsintohigh-dimensional space. • identifying significant factors and interactions. • finding regions, ranges, and thresholds where interesting things happen. • (ii) Finding robust decisions, tactics, or strategies; • (iii) Comparing the merits of various decisions or policiesKleijnen, Sanchez, Lucas & Cioppa 2005 “Models are for thinking”—Sir Maurice Kendall

  24. A simple example Without examining multiple factors simultaneously, we: • Limit the insights possible (can’t look for “interactions” - places where interesting things happen for specific combinations of factors) so… • …only tell part of the story • Less chance for surprises Ex: which is more important, stealth or range? Ex: suppose your factors include “fuel”, “air” and spark”. You’ll NEVER find “fire” by examining only two at a time. excursions from base case wouldn’t show anything

  25. Mean(Alleg1Cas(blue)) Each Pair Student’s t 0.05 Log twd concealment Effect Tests Alternate Tactical 3 Data Mining Through Visualization • Standard statistical graphics tools (regression trees, 3-D scatter plots, contour plots, plots of average results for a single factors, interaction profiles) can be used to gain insights from the data • Step-wise regression and regression trees identify important factors, interactions, and thresholds

  26. Count – total design points. Mean – mean number of Threat A casualties: lower is better. Significant decrease in Threat A casualties when UAVs>=2. No Threat B intelligence produced lower Threat A casualties MOE: Percent Threat A Casualties Increasing number of UAVs decreases Threat B casualties Training and communication impact Decrease foot speed in urban environment Internal communications becomes more important with fewer UAV

  27. Mobility Study Data: Grade vs. Speed Preliminary Assessment: Grade Alone is a Poor Predictor of Speed Data from Mastroianni and Middleton Mobility Study presented at 20th Army Science Conference

  28. However, When Energy Expenditure is Considered Data from Mastroianni and Middleton Mobility Study presented at 20th Army Science Conference

  29. Energy Cost vs. Speed • Energy Expenditure is a Better Predictor of Speed than Grade Data from Mastroianni and Middleton Mobility Study presented at 20th Army Science Conference

  30. Rules Derived From Fuzzy Clusters If Energy cost is Very Low, Speed is Very Fast If Energy cost is Low, Speed is Fast If Energy cost is High, Speed is Slow If Energy cost is Moderate, Speed is March If Energy cost is Very High, Speed is Very Slow Data from Mastroianni and Middleton Mobility Study presented at 20th Army Science Conference

  31. Speed Prediction Comparisons Correlation of Observed with “Fuzzy” Speeds is 0.91 Data from Mastroianni and Middleton Mobility Study presented at 20th Army Science Conference

  32. Chemical Persistence by Temp, Drop Size & Wind Speed

  33. One Way Analysis of Time to LNZ by Predictor Variable Levels

  34. Time Series Phases ? Concentration over time

  35. Data and Model Comparison Percent mass remaining over time Percent mass remaining over time Log scale

  36. Approach • View agent behavior curves as drop size dependent time series • Decompose series into lag sequences • Characterize sequences according to rate of change of per cent mass remaining • Use Weka to explore effectiveness of classifier systems Sample File Fragment

  37. Weka J48 Tree Classifer: KU Model Training Data

  38. Weka J48 Tree Classifer

  39. Weka J48 Tree Classifier: Data

  40. Conclusions & Future Potential • Method shows promise, but many issues remain: • How to represent real-world concentration data consistent with model data? • How to handle real-world noise? • Is it possible to decompose real-world signals from multiple drop size distributions? • What do we do about cloud drift and dynamic environmental conditions.

  41. Data Mining and You: Looking for Patterns • What is a pattern? • Association rules • Relationship models, e.g.; regression eqn.s, response surfaces • Attribute hierarchies • Classification and categorization • Pattern interestingness measures • Simplicity • Certainty or confidence • Utility or support • Novelty • Usefulness • Correlation v. cause and effect

  42. Pattern Recognition • Closely linked to machine learning • Supervised pattern recognition • Given a s et of K pre-determined classes • Given example measurements – features • Use a pattern recognition machine – classifier – which reports: • this example is from class Ci • This example is from none of the classes - outliers • This example is too hard – rejects or doubts This material is taken for B.D. Ripley; Pattern Recognition and Neural Networks; Cambridge University Press; 2005

  43. Avoid Post Hoc Ergo Propter Hoc

  44. Examples The Hemline Index of Stocks DJIA Hemline in Inches off the floor

  45. Analysis Study Review Criteria • Readability • Internal Consistency • Repeatability of Results • Reliability of Results • Support for conclusions • Significance of conclusions • Contribution to the literature • “So what”?

  46. Project Ideas? • Do tornados prefer trailer parks? • Is the “Avis effect” in sports real? • What combination of team statistics most correlates to winning and losing? • Is there a reliable terrorist profile? • What national characteristics best correlate to success in different educational fields? • What can you say about pre-election polls and actual election results? • Can you find evidence of national/regional diet and/or activity patterns that correlate to health issues? • What kinds of advertising are most effective for which products?

  47. Bottom Line Data Mining and Data Farming require getting your hands dirty with the data

More Related