1 / 24

Identifying Objects Using Cluster and Concept Analysis

Identifying Objects Using Cluster and Concept Analysis. Arie van Deursen Tobias Kuipers CWI, The Netherlands. Motivation. Legacy code incomprehensible Lack of structure Case: >100,000 LOC Banking System Cobol + VSAM data files Customer wanted OO redesign Data central to the system.

arwen
Download Presentation

Identifying Objects Using Cluster and Concept Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying Objects Using Cluster and Concept Analysis Arie van Deursen Tobias Kuipers CWI, The Netherlands

  2. Motivation • Legacy code incomprehensible • Lack of structure • Case: >100,000 LOC Banking System • Cobol + VSAM data files • Customer wanted OO redesign • Data central to the system

  3. General Plan • Find interesting data • Data selection • Candidate attributes • Find interesting functionality • Program selection (procedure) • Candidate methods • Combine the two • Candidate classes

  4. Input Selection • Domain related v. Implementation specific • Persistent data stores • Only records written to/read from file • Refine by CRUD (Create/Read/Update/Delete) • Records too big for one class • Analysis of Program Call Graph • high fan-out: control-programs • high fan-in: low-level technical

  5. Combining Data & Functionality • Cluster analysis -- technique for finding groups in data • Relies on metrics to compare distance between data items • Concept analysis -- for finding groups too • Relies on maximal subsets of data items sharing a set of features

  6. Cluster Analysis • Calculate distance (similarity) number between all data items (record fields) • Use clustering to find hierarchy

  7. 0 1 Name Title Initial Prefix Dendrogram

  8. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram

  9. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram Distance is 1

  10. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Distance is 1 City Dendrogram

  11. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram

  12. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram

  13. 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram

  14. 0 2 1 Dendrogram from Real Data Amount OfficeName BankCity IntAccount OfficeType PaymentKind RelationNr ChangeDate Account MortSeqNr MortNr TitleCd Prefix Initial Name ZipCd CountyCd StreetNr City Street

  15. Concept Analysis • Relies on maximal subsets of data items sharing a set of features • Concept analysis finds a lattice

  16. Set of features Set of items (field names) P1 P2 P3 P4  Concept Lattice  top All Variables bottom

  17. P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P1 P2 P3 P4  Concept Lattice  top All Variables bottom

  18. P1 Name Title Initial Prefix P3 P4 P2 P4 Street City P1 P2 P3 P4  Concept Lattice  top All Variables P4 Number Nb-Ext Zipcode Street City bottom

  19. P1 Name Title Initial Prefix P2 P4 P3 P4 City Street P1 P2 P3 P4  Concept Lattice  top All Variables P4 Number Nb-Ext Zipcode Street City bottom

  20. Real Concept Lattice 3 1 2 4 A B C D E F 5 G H M N O P I J K L 7 6 Q R S 11 12 13 14 10 9 8 X W V U T

  21. Concluding Remarks • Variable Selection - Input filtering • Records are natural starting point in data-intensive applications • Legacy/Cobol domain • Records are too big: Decompose them • Cluster analysis v. Concept analysis

  22. Cluster v Concept Analysis • Multiple partitionings • Clustering does not show all possibilities • Items in multiple groups • Features and clusters • Origin of cluster decision is lost • Concept more efficient computationally • Clustering needs more filtering

  23. Questions

  24. Current Approaches • Subsystem classification techniques • Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99 • Record as data part of a class • Newcomb & Kotik (‘95) take level 01 records, Fergen et al (94) compare structure of records for reuse • Manual Methodology • Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.

More Related