700 likes | 818 Views
Reverse Architecting. Arie van Deursen. Outline. Legacy systems Reverse architecting Architecture exploration Extraction Abstraction Presentation Evaluation. Motivation. Multi-channel distribution Web enable existing applications Due dilligence / QA Company merger
E N D
Reverse Architecting Arie van Deursen
Outline • Legacy systems • Reverse architecting • Architecture exploration • Extraction • Abstraction • Presentation • Evaluation
Motivation • Multi-channel distribution • Web enable existing applications • Due dilligence / QA • Company merger • Helping software immigrants • Estimating new functionality Documentation at best out of date
Legacy Systems Definition: • Any information system that significantly resists evolution • to meet new and changing business requirements Characteristics • Large • Geriatric • Outdated languages • Outdated databases • Isolated
Software Volume • Capers Jones software size estimate: • 700,000,000,000 lines of code • (7 * 109function points ) • (1 fp ~ 110 lines of code) • Total nr of programmers: • 10,000,000 • 40% new dev. 45% enhancements, 15% repair • (2020: 30%, 55%, 15%)
Reverse Architecting: Motivation • Architecture description lost or outdated • Obtain advantages of expl. arch.: • Stakeholder communication • Explicit design decisions • Transferable abstraction • Architecture conformance checking • Quality attribute analysis
Software Architecture Structure(s) of a system which • comprise the software components • the externally visible properties of those systems • and the relationships among them
Architectural Structures • Module structure • Data model structure • Process structure • Call structure • Type structure • GUI flow • ...
The 4 + 1 View Model Logical view Development view Use case view Physical view Process view Extract & compare!
Reverse Engineering • The process of analyzing a subject system with two goals in mind: • to identify the system's components and their interrelationships; and, • to create representations of the system in another form or at a higher level of abstraction. Decompilation Reverse Architecting
Reengineering • The examination and alteration of a subject system • to reconstitute it in a new form • and the subsequent implementation of that new form Beyond analysis -- actually improve.
Program Understanding • the task of building mental models of an underlying software system • at various abstraction levels, ranging from • models of the code itself to • ones of the underlying application domain, • for software maintenance, evolution, and reengineering purposes 50% of maintenance effort!!
Cognitive Processes • Building a mental model • Top down / bottom up / opportunistic • Generate and validate hypotheses • Chunking: create higher structures from chunks of low-level information • Cross referencing: understand relationships
Supporting Program Understanding • Architects build up mental models: • various abstractions of software system • hierarchies for varying levels of detail • graph-like structures for dependencies • How can we support this process? • infer number of predefined abstractions • enrich system’s source code with abstractions • let architect explore result
Architecture Exploration • Lesson from compiler construction: split processing in separate stages • Goal: Translate source code into form that can easily be processed by humans Similarity with compilers: translate source code into form that can be processed by machines • parsing turns source code into intermediate form • optimisation improves intermediate form • code generation emits the machine code
Architecture Exploration artifacts repository results extract view query • Extract src models from system artifacts • Query/manipulate to infer new knowledge • Present different views on results
Source Model Extraction artifacts repository results extract view query
Source Model Extraction • Derive information from system artifacts • variable usage, call graphs, file dependencies, database access, … • Challenges • Accurate & complete results • Flexible: easy to write and adapt • Robust: deal with irregularities in input
Syntax Errors Language Dialects Local Idioms Missing Parts Embedded Languages Preprocessing Grammar Challenges • Additional problem: grammar availability • process languages without grammar (e.g. undisclosed proprietary languages) • development of full grammar is expensive (Cobol: 1500 productions, 4-5 months)
accurate complete flexible robust syntactical + + – – lexical – – + + Processing Artifacts • Syntactical analysis • generate / hand-code / reuse parser • Lexical analysis • tools like perl, grep, Awk or LSME, MultiLex • generally easier to develop
Islands: accuracy & completeness Water: robustness Island Grammars • Grammar containing: • detailed productions for constructs of interest • liberal productions that catch remainder
Island Grammars • Grammar containing: • detailed productions for constructs of interest • liberal productions that catch remainder Input Parse tree “standard” grammar Parse tree island grammar
Island Grammars • Grammar containing: • detailed productions for constructs of interest • liberal productions that catch remainder Lisland Accept larger language: • catch dialects, syntax errors, embedded languages, … L
Island Grammars • Grammar containing: • detailed productions for constructs of interest • liberal productions that catch remainder Gi GL GL Often smaller grammar • can share productions • can have different structure Gi’
Example (Water) lexical syntax ~[] Water {avoid} context-free syntax Water Part Part* Input Water is “fall-back”
Example (Program Calls) lexical syntax ~[] Water {avoid} [A-Z][A-Z0-9]* Id context-free syntax Water Part Part* Input “CALL” Id Call Call Part Water is “fall-back”
Query and Manipulate artifacts repository results extract view query
Query and Manipulate • Goals: • infer new knowledge & abstractions • filter information • Example structures: • Perform graph • Call graph (OI, PVL) • Screen flow • Batch job • Subsystem dbs In search for more abstraction
Combining Data & Functionality • Cluster analysis • technique for finding groups in data • Relies on metrics to compare distance between data items • Concept analysis • for finding groups too • Relies on maximal subsets of data items sharing a set of features
Cluster Analysis • Calculate distance (similarity) number between all data items (record fields) • Use clustering to find hierarchy
0 1 Name Title Initial Prefix Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Dendrogram Distance is 1
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode Distance is 1 City Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram
0 2 1 Dendrogram from Real Data Amount OfficeName BankCity IntAccount OfficeType PaymentKind RelationNr ChangeDate Account MortSeqNr MortNr TitleCd Prefix Initial Name ZipCd CountyCd StreetNr City Street
Concept Analysis • Relies on maximal subsets of data items sharing a set of features • Concept analysis finds a lattice
Set of features Set of items (field names) P1 P2 P3 P4 Concept Lattice top All Variables bottom
P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P1 P2 P3 P4 Concept Lattice top All Variables bottom
P1 Name Title Initial Prefix P3 P4 P2 P4 Street City P1 P2 P3 P4 Concept Lattice top All Variables P4 Number Nb-Ext Zipcode Street City bottom
P1 Name Title Initial Prefix P2 P4 P3 P4 City Street P1 P2 P3 P4 Concept Lattice top All Variables P4 Number Nb-Ext Zipcode Street City bottom
Many fields Progr. nrs Concept Fields One field
System Views • Grouping method based on feature table • Metrics or subset based • Find alternative system views: • Kruchten’s logical view • Object-based view on procedural code • Starting point for “objectification” • Keep “human in the loop”
Types • A type describes a set of possible values • A type groups variables • A type encapsulates representation • Parameter types provide interfaces • Types provide component connectors Types are architectural structures
But types are already available... • Not in a legacy language like Cobol: • Data division declares variables + structure • No separation between type/variable. • Repeated structure per variable. • No enumeration types, no ranges. • No parameters for sections • Similar problems with other legacy languages
Automatic Type Inference • Group variables based on usage • Initially: • Each variable unique primitive type • From statements infer equivalencies: • Assignment v := e • Comparison e1 > e2 • Computation e1 + e2