Perspectives on Data Mining for SE

Perspectives on Data Mining for SE Tim Menzies tim@menzies.us http://menzies.us June 2010

“For all is but a woven web of guesses” • If we drill to the bottom of all theories • The floor drops away • Karl Popper • Ultimately, all theories are based on assumptions, which you don’t have the resources to test • E.g. try to repeat all the experiments that lead to current atomic theory. • Data miners make guesses • Danger of spurious model creation • Correlation, not causation • Shotgun correlation • The context variable problem • Some extra variable, not in your data set, that controls all • The instability problem Xenophanes

Heh, Nobody’s Perfect Bandwagon effect Base rate fallacy Bias blind spot Choice-supportive Confirmation bias Congruence bias Contrast effect Deformation professionelle Denomination effect Distinction bias Endowment effect Expectation bias Extraordinarity bias Focusing effect Framing Hyperbolic discounting Illusion of control Impact bias Information bias Interloper effect Irrational escalation Just-world phenomenon Loss aversion Mere exposure effect Money ilssuin More credential effect Need for closure Negativity bias Neglect of probability Normalcy bias Not Invented Here Omission bias Outcome bias Planning fallacy Post-purchase rationalization Pseudo certainty effect Resatance Restraint bias Selective perception Semmelweis relfex Status quo bias Vo Restorff effect Woshful thinking Zero-risk bias • Sure, problems with machine learners • But are they worse than humans? • Long history of fallacies in human reasoning • Wikipedia’s entry of 110+ biases in human cognition • http://en.wikipedia.org/wiki/List_of_cognitive_biases • Jorgensen’s study of expert effort estimation for software development (TSE 2009) • Human experts do not review and improvetheir own estimates

http://www.youtube.com/watch?v=hwCzasHBXNc Gorilla’s in our Midst • This is a test • How many times does the white team pass the rubber band ball? • In controlled studies, over half the observers did not see a five foot hairy gorilla walk onto the set. • Gorillas in our midst: sustained inattentional blindness for dynamic events perception, 1999, volume 28, pages 1059 - 1074 Daniel J Simons, Christopher F Chabris, Perception, 1999 • Lesson: • Humans can miss the “obvious” • Code inspections • Teams inspecting software with 15 bugs • 60 lines of code: a simple text formatter • Unlimited time for review • Each team found five • No combination of teams found all bugs

Harder to use expertise from humans • In a court of law, engineers who do not apply best practices are criminally liable • The project is late, • The initial estimates were wrong, • You are in court testifying • Can you convince a judge/jury that you did the best you could with the initial estimates • If human-based, where is the audit trail? • For this reason, • Many US government contracts require all estimates to be audited by some external source • If human-based, how to do (and document) that audit in a way that makes it reproducible and defensible in N years time?

Anyway, you can’t stop the music • What’s fair got to do with it? It’s going to happen • – Lawerence of Arabia • Web 2.0 is here • Software engineering 2.0 is coming • Massive data collection • E.g. Microsoft instrumenting all activity in VisualStudio • Lot of interest in applying machine learning to SE data • Menzies 2007 (TSE): 125 citations • Not bad for a recent article • So, its going to happen. • What is the best we can do with it?

A partnership (wholes & holes) • Humans paint the whole picture • Business-level constraints • Identify an initial set of possibly useful variables • State a set of initial hypotheses • Data miners fill in the holes • Find the details of the relationship X to Y • Cover over, bury, ignore, irrelevancies • Learners propose, Humans dispose • Automatic methods summarize data spaces too large for human reflection • Find the diamonds in the dust • Humans audit the results, suggest next round of investigations • Requirement • Human’s can read and understand the learned theory • Bye bye neural nets,, Naïve Bayes and all other methods that generate no model or generate verbose output.

A new hope • Key variables • In most models, a few variables set the rest • So instead of learning complex models • Just chase the keys • E.g. Show at right are the relative influence of different attribute rangesin one process model towards predicting for high defects • Note that most ranges are relatively useless at making that prediction

Btw, constantly (re)building localmodels is a general model Kolodner’s theory of reconstructive memory The Yale group Shank & Riesbeck et al. Memory, not models Don’t “think”, remember Case-based reasoning

Perspectives on Data Mining for SE

Perspectives on Data Mining for SE

Presentation Transcript

Data Mining on ICDM Submission Data

Regression for Data Mining

Artificial Intelligence on Big Data for Mining Applications

Motivation: Why data mining What is data mining Data Mining: On what kind of data Data mining functionality Classificati

Data Mining on Streams

Perspectives on Data Citation: Building Data Citations for Discovery

Aligning Perspectives on Administrative Data - Building Support for Data Use

On the use of data mining for imputation

Mining for ADE Data

Data Mining on NIJ data

Data Mining: Data

International Perspectives on Data and Information for Science

Tutorial on Data Mining

Dendrograms for Data Mining

NASA Perspectives on Data Quality

Data Mining on Streams

Federal Perspectives on Transportation Data

Data Mining for Data Streams

Data Mining for Engineers