240 likes | 364 Views
Information theory concepts in software engineering. Richard Torkar richard.torkar@bth.se. Blekinge Institute of Technology. A YOUNG INSTITUTE. Founded in 1989 One of three independent institutes of technology Three campuses. ME. Richard Torkar Former officer
E N D
Information theory concepts in software engineering Richard Torkar richard.torkar@bth.se
A YOUNG INSTITUTE • Founded in 1989 • One of three independent institutes of technology • Three campuses
ME • Richard Torkar • Former officer • PhD in software engineering at BTH(studied at University West and Chalmers) • REFASTEN project • Director for SWELL • Project manager for CONES • Programme manager POKAL and EMSE • Participating in EASE and RUAG • Prof. ClaesWohlin’s research group SERL
Partner with problems… • Millions of test cases constantly running (24/7) • Tests a system containing 25-30 large subsystems • Contractors and divisions all over the world use the same test bed
NORMALISED COMPRESSION DISTANCE2 • KolmogorovComplexity • Cilibrasi and Vitányi used a compression3algorithm for approximating K • Non-neg number 0<=NCD<=1+e, where e depends on how good C approximates K
WHAT´S BEEN DONE? • Information distance (ID) • The ID between two binary strings, x and y, is the length of the shortest program that translates x to y and, consequently, y to x. • ID is the universal distance metric • Minimal among computable distance functions • Uncovers all effective similarities
What to try? Hcog: Ordering tests based on their ∆VAT distance cannot be distinguished from how a human would order the tests based on their ‘cognitive similarity’.
Defs • A complete VAT trace of a test is a string with all the information about the actual execution of a test for all the variation points in the VAT model. • The Universal Test Distance, denoted ∆VAT, in the VAT model is the information distance between the complete VAT traces of two tests.
UNIVERSAL TEST DISTANCE • Universal Test Distance (UTD): Information Distance of complete VAT traces ofn tests (where n>= 2) • Should discover any similarities1between tests… • But ID is non-computable!
USING NCD AS A TEST DISTANCE • Uncover “meaningful” distances? • Three engineers ordered 25 tests applied on the triangle problem • Coded these tests in Bacon (Ruby) • Traced exec of each test saved • Calculated an NCD matrix (distance tree)
RESULTS • Humans and NCD classified in the same way (rooted non-binary trees) • NCD: • Args permutations grouped together • float case close to int • Division between valid and non-valid
CONCLUSIONS • NCD can cluster tests on cognitive similarities • Differences we see are mainly explained by “white-boxness” (traces include implementation details) • Input data only is not sufficient • NCD calculations are costly • Could be used as a way to smooth the search space?
WHAT WILL BE DONE? • If we can measure distance, basically any distance, then why not measure: • Scientific real world propagation • Quality of alternative information sources • Quality of individual engineers… • Clustering trouble reports, using the CH or Silhouette index (and then do RCA on clusters instead of individual reports to get indications regarding fault modules)
OTHER TODOs • Statistical tests for cumulative voting • Semi-automated sys lit review via abstract clustering http://www.torkar.se
NODE DESCRIPTIONS • For p. 12: • XY_A1_A2_A3 • X = S/L: short/long integer arguments • X = F: Float arguments • Y = E: Equilateral triangle • Y = S: Scalene triangle • Y = I: Isosceles triangle • Y = X: Invalid triangle
References [1] M. Li, X. Chen, X. Li, B. Ma and P.M.B. Vitányi, P. 2003. The similarity metric. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, Maryland, January 12 - 14, 2003). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, pp. 863-872. [2] R.L. Cilibrasi, P.M.B. Vitányi, “The Google similarity distance,” IEEE Transactions on Knowledge and Data Engineering, pp. 370-383, March, 2007 [3] P.M.B. Vitányi and L. Ming, “Minimum description length induction, Bayesianism, and Kolmogorov complexity,” IEEE Transactions on Information Theory, (46) pp. 446-464. 2000. [4] P.M.B. Vitanyi, F.J. Balbach, R.L. Cilibrasi, M. Li, Normalized information distance, pp. 45-82 in: Information Theory and Statistical Learning, Eds. F. Emmert-Streib and M. Dehmer, Springer-Verlag, New-York, 2008.