160 likes | 303 Views
And defect prediction. And software engineering. The deviance problem in effort estimation. tim@menzies.us PROMISE-2. Software effort estimation Jorgensen: most effort estimation is “expert-based”; so “model-based” estimation is a waste of time
E N D
And defect prediction And software engineering The deviance problem in effort estimation tim@menzies.us PROMISE-2
Software effort estimation Jorgensen: most effort estimation is “expert-based”; so “model-based” estimation is a waste of time model- vs expert-base studies: 5 better, 5 even, 5 worse Software defect prediction Shepperd&Ince: static code measures un-informative for software quality dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse I smell a rat Variance: confusing prior results? Selecting Best Practices for Effort Estimation - Menzies, Chen, Hihn, Lum. TSE 200X Data Mining Static Code to Learn Defect Predictor - Menzies, Greenwald,Frank, TSE 200X
What you never want to hear… • “This isn't right. This isn't even wrong.” • Wolfgang Pauli
Standard disclaimer • An excessive focus on empiricism … • … stunts the development of novel , pre-experimental, speculations • But currently: • there is no danger of an excess of empiricism in SE • SE= a field flooded by pre-experimental speculations.
Public domain data Don’t test using your training data N-way cross val M * randomize order Straw man Feature subset selection Thall shalt script you will run it again Study mean and variance over M * N Defect predictions Sampleexperiments effort estimation defect prediction
Massive FSS Singletons, including LOC, not enough Data summation:K.I.S.S. • Combine PD/PF • Compute & sort combined performance deltas , method A vs all others • Summarize as quartiles • 400,000 runs • Nb= naïve bayes • J48= entrophy-based decision tree learner • oneR=straw man • logNums= log the numerics
Software effort estimation Jorgensen: most effort estimation is “expert-based”; so “model-based” estimation is a waste of time model- vs expert-base studies: 5 better, 5 even, 5 worse Software defect prediction Shepperd&Ince: static code measures un-informative for software quality dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse I smell a rat Variance: confusing prior results?
Software effort estimation 30 * { shuffle, test = data[1..10] train = data - test, <a,b> = LC(train) MRE = Estimate(a,b,test) } Software defect prediction 10 * { randomly select 90% of data, score each attribute via “INFOGAIN” } Target class: continuous Target class: discrete Large deviations confuse comparisons of competing methods Numerous candidates for “most informative” attributes Can be reduced by FSS Sources of variance
PCA worse (empirically) INFOGAIN fastest Useful for defect detection e.g. 10,000 modules in defect logs WRAPPER slowest Performs best Practical for effort estimation e.g. dozens of past projects in company databases Turned blue to green What is Feature Subset Selection? “wiggle” in x,y,z causes “wiggle” in “a” Removing x,y,z,reduces “wiggle”in “a” But can damage mean performance a = 10.1 + 0.3x + 0.9y - 1.2z
Warning: no single “best” theory effort estimation defect prediction
Ensemble-based learning bagging, boosting, stacking, etc Conclusions by voting across a committee 10 identical experts are a waste of money 10, slightly different, experts can offer different insights onto a problem Committee-based learning
Classification ensembles: “Majority vote does as good as anything else”- Tom Dietrich Numeric prediction ensembles Can use other measures: “heuristic rejection rules” Theorists: “gasp horror” Seasoned cost-estimation practitioners: “of course” Using committees Standard statistics failing . T-tests report that none of these are “worse
For any pair of treatments, • If one is “worse” • Vote it off • Repeat till none “worse” survivors
So, those M*N-way cross-vals Time to use them. New research area Automatic model selection methods are now required Data fusion in biometrcs The technical problem is not the challenge Issues with explanation and expectation So…
Why so many unverified ideas in software engineering? • Humans use language to mark territory • Repeated effect: linguistic drift • Villages, separated by just a few miles, evolve different dialects • Language choice = who you talk to, what tools you buy • US vs THEM: SUN built JAVA as a weapon against Microsoft • Result: never-ending stream of new language systems • Vendors want to sell new tools, not assess them.
Text mining of NFRs, traceability: IEEE RE’06 (Minniapolis, 2006) The Detection and Classification of Non-Functional Requirements Cleland-Huang, Settimi, Zou, Solc IEEE TSE Jan, 2006, p 4-19: Advancing Candidate Link Generation for Requirements Tracing Hayes, Dekhtyar, Sundaram Software effort estimation IEEE TSE 200? Selecting Best Practices for Effort Estimation Menzies, Chen, Hihn, Lum Software defect prediction IEEE TSE 200? Data Mining Static Code Attributes to Learn Defect Predictors Menzies, Greenwald, Frank But, the tide is turning Best paper Yes Timmy, senior forums endorse empirical rigor