240 likes | 361 Views
Identifying Objects Using Cluster and Concept Analysis. Arie van Deursen Tobias Kuipers CWI, The Netherlands. Motivation. Legacy code incomprehensible Lack of structure Case: >100,000 LOC Banking System Cobol + VSAM data files Customer wanted OO redesign Data central to the system.
E N D
Identifying Objects Using Cluster and Concept Analysis Arie van Deursen Tobias Kuipers CWI, The Netherlands
Motivation • Legacy code incomprehensible • Lack of structure • Case: >100,000 LOC Banking System • Cobol + VSAM data files • Customer wanted OO redesign • Data central to the system
General Plan • Find interesting data • Data selection • Candidate attributes • Find interesting functionality • Program selection (procedure) • Candidate methods • Combine the two • Candidate classes
Input Selection • Domain related v. Implementation specific • Persistent data stores • Only records written to/read from file • Refine by CRUD (Create/Read/Update/Delete) • Records too big for one class • Analysis of Program Call Graph • high fan-out: control-programs • high fan-in: low-level technical
Combining Data & Functionality • Cluster analysis -- technique for finding groups in data • Relies on metrics to compare distance between data items • Concept analysis -- for finding groups too • Relies on maximal subsets of data items sharing a set of features
Cluster Analysis • Calculate distance (similarity) number between all data items (record fields) • Use clustering to find hierarchy
0 1 Name Title Initial Prefix Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram Distance is 1
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Distance is 1 City Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 2 1 Dendrogram from Real Data Amount OfficeName BankCity IntAccount OfficeType PaymentKind RelationNr ChangeDate Account MortSeqNr MortNr TitleCd Prefix Initial Name ZipCd CountyCd StreetNr City Street
Concept Analysis • Relies on maximal subsets of data items sharing a set of features • Concept analysis finds a lattice
Set of features Set of items (field names) P1 P2 P3 P4 Concept Lattice top All Variables bottom
P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P1 P2 P3 P4 Concept Lattice top All Variables bottom
P1 Name Title Initial Prefix P3 P4 P2 P4 Street City P1 P2 P3 P4 Concept Lattice top All Variables P4 Number Nb-Ext Zipcode Street City bottom
P1 Name Title Initial Prefix P2 P4 P3 P4 City Street P1 P2 P3 P4 Concept Lattice top All Variables P4 Number Nb-Ext Zipcode Street City bottom
Real Concept Lattice 3 1 2 4 A B C D E F 5 G H M N O P I J K L 7 6 Q R S 11 12 13 14 10 9 8 X W V U T
Concluding Remarks • Variable Selection - Input filtering • Records are natural starting point in data-intensive applications • Legacy/Cobol domain • Records are too big: Decompose them • Cluster analysis v. Concept analysis
Cluster v Concept Analysis • Multiple partitionings • Clustering does not show all possibilities • Items in multiple groups • Features and clusters • Origin of cluster decision is lost • Concept more efficient computationally • Clustering needs more filtering
Current Approaches • Subsystem classification techniques • Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99 • Record as data part of a class • Newcomb & Kotik (‘95) take level 01 records, Fergen et al (94) compare structure of records for reuse • Manual Methodology • Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.