WCRE 1999 / 2009

WCRE 1999 / 2009 Experiments with clustering as a software remodularization method Nicolas Anquetil Timothy C. Lethbridge

Forewarning Nicolas: • After this research I became suspicious of the usefulness of clustering for remodularization. I still am.

You have been warned (although note that Tim has a less gloomy view)

Agenda • Background of the research • Overview of the paper • From then until now • And now what? • An analogy • Another analogy

Background of the research Context: • KBRE group, U. of Ottawa, Canada • CSER project (Consortium for Software Engineering Research) • Pairs: university/company(U. Of Ottawa/Telecom. company) • Focus on real problems and/orreal situations

Background of the research The project: One company's PBX • 2+ MLOC • 2+ K files • 10+ possible configurations • 10+ years old (in 1999) • 2 proprietary languages • 1 directory • 0 packages

Background of the research Company situation: • High turnover (18 months) • High entry barrier (6+ months to be productive) • Aging software (and languages) • Configuration management difficulties

Overview of the paper ”providing solutions to help software engineers understand, restructure or migrate old software towards more modern architecture and/or languages”

Overview of the paper Possible solution: ”Clustering is used to gather software components into modules significant to the software engineers.”

Overview of the paper • Seminal paper by Theo Wiggerts, “Using Clustering Algorithms in Legacy Systems Remodularization”, WCRE'97 • Summary of the literature on clustering • Lists all the possible choices • Lists some advantages and drawbacks of these choices

Overview of the paper ”Clustering is a sophisticated research domain with many methods [...] Reverse engineering is a young domain [...] Clustering has been used with no deep understanding of all the issues involved.”

Overview of the paper ”Conclusions of Wiggerts' paper are those of the literature which may not entirely hold for reverse engineering.”

Overview of the paper • For example: • Living things naturally fit in an evolution tree (more or less) • Not so with software modularization • This must impact the tools we use and how we use them

Overview of the paper • Three issues • What clustering algorithms to use? • How to compute cohesion? • How to describe entities? • How to evaluate the results?

Overview of the paper • Algorithms • We tested mainly hierarchical agglomerative algorithms • Some tests with hill-climbing algorithms (”Bunch” tool: Mancoridis)

Overview of the paper • Entities • We clustered files (into packages) • Description • Elements contained in the files: • Types, variables, routines, macros, comments, identifiers

Overview of the paper Reminder: ”Clustering algorithms do not discover some hidden structure in a system, but impose a structure on the set of entities they are given.”

Overview of the paperSome results • Redundancies among description schemes: • File, routine, variable, macro, type • Comments, identifiers

Overview of the paperSome results • Combining features (routine + variable + ...) improves the results

Overview of the paperSome results • Direct/sibling links • Sibling more used and better

Overview of the paperSome results • Avoid “sparse” descriptive features • Avoid similarity metrics that consider absence of a feature as significant

From then until now • Raw numbers • What extensions?

From then until nowReferences (volume) [data from Google scholar]

From then until nowReferences (authors) • P.Tonella(8), F.Ricca(7), C.Girardi(5), E.Pianta(5) • O.Maqbool(7), HA.Babri(6) • C.Tjortjis(5) • N.Anquetil(5) • S.Ducasse(5) • K.Sartipi(4) [data from Google scholar]

J.Syst.Soft. = 4 ICSM = 3 ICSE = 2 Trans.Syst.Eng. = 2 From then until nowReferences (venue) • Thesis =11 • CSMR = 6 • IWPC = 6 • WCRE = 5 • J.Soft.Maint.Evol. = 4 [data from Google scholar]

From then until nowSome extensions • Clustering, how? • New/improved algorithms • New/improved distance metrics • Clustering what? • New entities (and/or description) • Clustering, why? • Other extensions

From then until nowNew algorithm • Genetic algorithm • [Mahdavi] • “Combined algorithm” • [Saeed, Maqbool, Babri, Hassan, Sarwar]

From then until nowNew distance metric • Minimization of information loss • [Andritsos, Tzerpos]

Data vs. Control [Davey,Burd], [Sartipi,Kontogiannis] Dynamic data [Stroulia,Systä] Co-change records From then until nowNew entities • Static web pages • [Di Lucca, Fasolino, Tramontana] • [Tonella,Ricca,Pianta, Girardi] • Association rules • [Maqbool,Babri]

From then until nowOther extensions • Evaluations / comparisons • [Tonella], [Wu, Holt], [Parsa, Bushehrian] • Framework

From then until nowOther extensions • Needs of maintainers? • [Tjortjis, Layzell] • Input for visualization tools • [Ducasse] • Naming clusters • [Tzerpos], [Maqbool, Babri]

And now what? • Back to paper's results • Wild ideas in clustering • Related topics

And now what?Paper's results • Choice of (traditional) algorithm matters little • It will give a result • Not significantly better or worse than other

And now what?Paper's results • Choice of similarity metric matters little • As long as they don't consider absence of a feature as a sign of similarity

And now what?Paper's results • Choice of description scheme for entity matters a bit more • May be source of short term progress? • Using dynamic information?

And now what?Wild ideas • Consider new entities? • Individual instructions? • Non code: requirements, model elements, tests, … ? • Process-wise modularization? • Clustering requirements, models elements, ...

And now what?Related topics • Problem without solution? • Software modularization is highly subjective • Packages are not mutually exclusive • Decisions must be made that are always wrong (and always correct)

And now what?Related topics • Modularization is a logical (virtual) decomposition based on semantics • High cohesion, low coupling may only be an (imperfect) by-product of pre-chosen modularization • Cohesion/coupling not a driving force but a secondary goal? • Other forces, e.g. packages of “comparable” sizes

And now what?Related topics • Typical example: Utility package • Low cohesion, high coupling • java.util • BitSet, Calendar, Currency, Dictionary, EventListenerProxy, Formatter, Observable, Random, ResourceBundle, Scanner, UUID, TimeZone, ...

And now what?Related topics • How to evaluate results? • Open question in the paper • Cohesion/coupling • Normaly useless because it is the function optimized by the algorithms • Gold standard • Manually: expensive, not precise • Automatically: biased

And now what?Related topics • How to evaluate results? • Other metrics, e.g. Stability, Non-extremity [Wu]

And now what?Paper's results • ”The fact that all six algorithms are ranked low on authoritativeness suggests that they may not be mature enough for use in production on large systems undergoing evolutionary change.However ...”[Wu, Holt, 2005]

An analogy • A short story of Belo Horizonte: • In 1893 a new capital is planned in the state of Minas Gerais (Brazil) • The arquitects/urbanists get inspiration from Washington D.C.

An analogy • The initial architecture: • Planned Belo Horizonte

An analogy • The city grew (2.5 Mhab., area=5.1 Mh.)

An analogy • The city grew (2.5 Mhab.)

WCRE 1999 / 2009

WCRE 1999 / 2009

Presentation Transcript

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

1999

1999

10 th Anniversary 1999 - 2009

1999

TOTAL CASE FILINGS - MAINE CALENDAR YEARS 1999 – 2009

Cervical Cancer Incidence in Iowa: 1999 to 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

10 th Anniversary 1999 - 2009

Sino-US Relations (1999-2009)