Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example

Dependency Analysis: - CSE681 Pr#2- an Interesting Design Example Jim Fawcett CSE784 – Software Engineering Studio Fall 2002

Summary of Project Requirements • This year in Software Modeling and Analysis we are developing a source code dependency analyzer. • Program must analyze C++ and C# source code files and determine code dependency relationships between them. • Easy for C++, just extract include statements. • Not so easy for C#. Have to catalog all classes and structs defined by each source file and see which files use those types. • Program must display all the dependency results to the user and also save in an XML file.

Evolving View • First reaction: • This is simple, easy to implement, no issues. • Second reaction: • Oooohhhhh! Not so simple, many issues. • Final reaction – if we do the design well: • O.K. • All the parts are simple. • They play together well. • Performance issues are manageable.

Too Soon to Construction • We start with the first reaction. • Only get the second reaction when we have committed to a lot of code. • We now have a mess. • We’re very likely to hold onto the mess because of our investment of time and effort. • That’s a mistake, as we just get mired deeper and deeper into the mess.

Too Soon to Code! Rising tide of chaos and drudge. (Drudge is very ugly and fragile code)

Design Issue #1 – Client Focus • Program should operate quickly • We need to think about algorithms and data structures. • It should address user goals • Use case studies are needed. • User interface should be designed to support uses. • What information does the user supply? Options: • search current directory • tree rooted at specified directory (current default) • specified named path/pattern • What information does the program supply and how is it displayed? • There are likely to be very large file sets to analyze. • User needs options to select important information

Use Cases • Dependency analysis generates information for: • Building test plans: • Don’t test a module until all the modules on which it depends have been tested. • Software maintenance: • What modules depend on the module we plan to change? We need to test them after the change to see if they have been adversely affected. • Documentation: • Documenting dependency information is an integral part of the design exposition.

Scope of Analysis • This project is concerned with dependencies between a program’s modules. • A module is a relatively small partition of a program’s source code into a cohesive part. • A typical module should consist of about 400 source lines of code (SLOC). • Obviously some will be smaller, some larger, but this is a good target size • Typical project sizes are: • Modest size research project – 10,000 sloc 25 modules • Modest size commercial product – 600 kslocs 1,500modules

Conclusions from Use Case Analysis • Even for relatively modest sized research projects, there is too much information to do an adequate analysis by hand. • We need automated tools. • The tools need to show dependencies in both ways, e.g.: • What files does this file depend on? • What files depend on this file? • The tools need to disclose dependencies between all files in the project.

Critical Issues • Scanning for Dependencies in C# modules • Data structure used to hold dependencies • Displaying large amounts of information to user • False dependencies due to unneeded includes in C++ modules • Dependence on System Libraries

Displaying Large Sets of Dependencies • User will probably want to: • Show dependencies sorted: • Display all files with no dependencies • Then all files that depend only on those already displayed • Useful for building test plans • Enter a name and get list of dependencies. • Useful for maintenance work. • In popup window show list of files entered so far and list of files not entered yet. • Show a scrolling list of files with their dependencies. • Select subset of files for display. • Show a compressed (bitmap?) matrix. • Show list of names, not matrix row. Matrix row may be far too long to view (e.g., 1500 elements).

Design Issue #1 – Client Focus Revisited • Program should operate quickly • We need to think about algorithms and data structures. • Can we process 1500 files in a reasonable amount of time? • What is reasonable? Results must be worth the wait. • It should address user goals • Use cases have shown that: • Program will be useful in a number of contexts. • Program may have to process a very large number of files. • How do we find them? • How do we display large amounts of resulting information? • User interface should be designed to support uses. • What information does the user supply? Options: • search current directory • tree rooted at specified directory (current default) • specified named path/pattern • What information does the program supply and how is it displayed? • There are likely to be very large file sets to analyze. • User needs options to select important information • Design must insure that the program displays information, not just data!

Design Issue #2 – Organizing Principles • Those design ideas, data structures, operation sequences, and partitions that make the design appear to be simple. • You know you have a good set of Organizing Principles when you grow confident that you can successfully implement the program.

Dependency Scanning for C# source • Will naïve scanning work for 1500 files? • If opening and scanning a single file takes 25 msec, then: • Finding dependencies for 1 file takes: 0.025 X1500 / 60 = 0.625 minutes • Finding dependencies for all files takes: 0.625 X 1500 / 60 = 15.6 hours! • So let’s scan each file once and store all its identifiers in hash table in RAM. • If that takes 30 msec per file: • Then making hash tables for all files takes: 0.03 X 1500 / 60 = 0.75 minutes • If hash table lookup takes 10 sec per file then finding dependencies between all files takes: 0.00001 X 1500 X 1500 / 60 + 0.75 = 1.125 minutes!

Prototype Code • Scanning – critically important • How much time to open file and scan for class, struct identifiers? • How much time to build HashTables and HashedFile objects? • How much time to evaluate dependencies between two files by HashTable lookup? • Sizes - important • How big is HashedFile object for typical files? • User Display – could leave to design team with requirement for early evaluation. • Mockup display alternatives.

Timing Results Parsing Prototype Source

Comparison of Estimated with Measured • Naïve scanning – scan each file 1500 times: • Estimated time to complete scanning of 1500 files: 15.6 hours • Measured time to complete scanning of 1500 files: 4.4 hours • Processing each file once and storing in Hashtable, then doing lookups for each file: • Estimated time to complete processing: 1.1 minutes • Measured time to complete processing: 0.2 minutes

Hash Table Layout

Memory to Store Hash Tables • Assume each file is about 500 lines of source code  about 30 chars X 500 = 15 KB • Assume that 1/3 of that is identifiers • The rest is comments, whitespace, keywords, and punctuators  5 KB of indentifier storage • Assume HashTable takes 10 KB per file, so the total RAM required for this data is: 0.01 X 1500 = 15 MB. • That’s large, but acceptable on today’s desktop machines.

File Scanning • For each file in C# file set: • For each class and struct identifer in file • Look in every other file’s HashTable for those identifiers • If found, other file depends on current file • Record dependency • Complexity is O(n2) • For each file in C++ file set: • #include statements completely capture dependency. • Record dependency • Complexity is O(n)

C# Scanning Process

C# Scanning Activities • Define file set • User supplies by browsing, selection, patterns • User may wish to scan subdirectory • Extract token information from each file: • Extract tokens from each file and store in HashTable. • Save list of Class and Struct identifiers from scan • Create HashedFile type with filename, class and struct list, and HashTable as data. • Store HashedFiles in ArrayList • For each HashedFile in list: • Walk through ArrayList searching HashTables for the identifiers in class and struct list (note that this is very fast). • First time one is found, stop processing file – dependency found.

C# Scan Activity Diagram

C++ Scan Activity Diagram

Memory to Hold Dependencies • Naïve storage uses a dense matrix. With 1500 files, that’s 2,250,000 elements. • Assume each path name is stored only once and we save 75 bytes of path information, so with 1500 files  112.5 KB • Dependency is a boolean and takes 1 byte to store  2.25 MB. • So, the total dependency matrix takes 2.36 MB. • Therefore, naïve storage is acceptable.

Storing Dependencies

Sorting Dependencies • Dependencies can be sorted easily. • Need to define a partial order on the set of dependencies. • Not every pair of files has an order. • Topological sort will bring set of files into an order in which any files that have dependency are ordered. • Have to be careful with mutual dependencies. • Must detect so that sorter does not cycle endlessly. • Reordering is fast and trivial: • Just swap index values in the FileIndexer Hashtable.

Dependency Inversion • Use cases have pointed out that a user may need two kinds of information: • What files depend on this file? • What files does this file depend on? • If we provide a simple interface for the dependency object like this: • Dependency(ulong size) • bool Depends(string file1, string file2) • void SetDep(string file1, string file2) • Void Swap(string file1, file2) • String[] GetDep(string file) Then inversion is just a matter of changing the order of two file names. • Swap allows dependency relations to be sorted without exposing internal data members.

Design Issue #2 – Organizing Principles Revisited • Organizing Principles for Dependency Analysis are mainly structural: • Encapsulate C# scanning in a set of HashedFile objects held in memory in an ArrayList. • Puts data (only) where it is needed. • Easily accessible to the scanner, hidden from other parts. • Encapsulate dependency storage in a Dependency object. • Inversion is trivial. • Accessible to scanner and GUI view manager, but clients don’t need to know any of the details. • These components make the design appear to be simple.

C# Scanner Class Diagram

Design Issue #3 – Program Structure • Should have an Executive module. • Makes program organization simpler. • Handles program requirements using specialized server modules. • May allow the servers to be reuseable, e.g., application agnostic. • Each server is focused on a single cohesive activity. • When this happens, server module is much more likely to be reuseable. • Much easier to test small, cohesive modules.

Partitions

Design Issue #3a - Communication • Partition to make data transactions as small and simple as possible. • FileHandler module creates a file list and sends a list reference to executive. • Executive passes reference to scanner. • Scanner passes dependency relation between two files to Depends module. • Many simple transactions. • View module sorts the Depends relationships. • Uses interface provided for that purpose. • Avoids polluting Depends module with user interface processing.

Design Issue #3b - Ownership • Ownership determines: • What objects are visible to what other objects. • Who creates each object. • Who determines an object’s lifetime. • Usually we want the creator to also control lifetime. • The nature of associations between objects in C++ affects ownership: • One object contains an instance of another. • Containing object ownes the contained, creates and destroys it. • One object contains a reference to another. • Containing object is not the owner, neither creating nor destroying contained object. • An object has a pointer to another. • May be the owner – it created the object on the heap and will destroy. • May not be the owner – another object passed it the pointer allowing access, but not transferring ownership.

Partitions

Dependency Program Ownership • Executive owns fileHandler, Scanner, DataViewManager objects. • fileHandler owns DirectoryNavigator. • The Dependency object could be owned by either the specialized scanners or by the DataViewManager. • The later seems simpler. • Could make FileIndexer and Depends matrix static. Then any user can simply create an instance of Dependency. • All get the same data since data is static. • Note that communication and ownership are determined by a “need-to-know” principle.

Design Issue #3c - Visibility • Visibility affects reuseability: • If we let the FileHandler module use the Scanner module we compromise its reuseability. • FileHandler will be useful for a lot of applications that have nothing to do with scanning. • Visibility affects simplicity: • If we provide a base Scanner class to provide a protocol for Executive to use, the Executive does not have to know anything about the differences between specialized scanners. • Should we decide later to add dependency analysis between script files, nothing changes except that we add a new specialized scanner.

Design Issue #4 - Performance • Usually, by performance, we mean efficient use of both CPU cycles and memory. • CPU cycles are often very sensitive to the algorithms we use. • Changing from naïve scanning to use of hash tables decreased running time by almost 3 orders of magnitude! • Encapsulating dependency relations behind a simple interface allows us to easily change to a more memory efficient data structure: • It would be relatively simple to replace the dense matrix (array of arrays) with a sparse matrix (list of lists). • That definitely would improve memory use, but degrade running time.

False Dependencies in C++ files • Need to scan both .h and .cpp files. • Could programmatically comment out each include – one at a time – and attempt to compile, thus finding the ones actually needed. • We would probably do this with a seperate tool. • We could also just scan, as we do for C#, but that is harder for C++ since we need to check dependencies on global functions and data as well as classes and structs.

Dependence on System Libraries • Not practical to scan for system dependencies in C#. • Can’t find source modules. • System dependencies can be found using reflection, but are not particularly useful. • System dependencies in C++ are easy to find from #include<someSystemHeader> • This information is often useful, so why not provide it?

Summary of Critical Issues • Scanning for Dependencies in C# modules √ • Data structure used to hold dependencies √ • Displaying large amounts of information ~√ • False C++ dependencies ~√ • Dependence on System Libraries • C# X • C++ √

Design Attributes • Abstraction - model of responsibilities • Modularity - physical packaging of program • Encapsulation - public interface, private implem. • Hierarchy - delegation of responsibility • Cohesion - focus on single activity • Coupling - minimize data transfer • Locality of reference - use data where defined • Size and Complexity - small testable components • Use of objects - self contained management

Partitions

Attribute #1 - Abstraction • Program modules have clearly defined responsibilities: • FileHandler - generate file set • Scanner - evaluate dependencies • Depends - store dependency relations • View - present dataviews as options • Executive - orchestrate program operations

Attribute #2 - Modularity • Module boundaries promote reuseability • FileHandler and Depends can be reused without changing a single character in their source code. • Responsibilities are well factored: • Two modules manage data - FileHandler and Depends. • Two modules manage user interface - Executive and View • One module conducts analysis – Scanner • Construction can be incremental: • Build in order: Executive, FileHandler, Dependency, Scanner, View • Testing can be incremental: • FileHandler and Dependency can be tested in isolation. • Scanner and View can be tested using FileHandler and Dependency as black boxes.

Attribute #3 - Encapsulation • Each server module has well defined interface and private implementation • FileHandler accepts path/pattern and returns file set • Needs no other interactions • Dependency stores and retrieves dependency relation from two file names. • Needs no other interactions. • Scanner accepts file list and Dependency reference, returns filled Dependency. • Only scanner needs the details of analysis context. • DataView • Presents views using only external interface of Depends. • Don’t need to pass around pointers or references to anything accept the file set.

Attribute #4 - Hierarchy • Delegation is effectively handled: • Modules: • Executive delegates to FileHandler, Scanner, and View • Scanner delegates to Depends • Dataview delegates to Depends • Classes: • fileHandler delegates to DirectoryNavigator • Polymorphism is used effectively: • Scanner class provides protocol for Executive. • CsScanner and CppScanner provide specialized behavior • Could easily add new scanner for a new analysis context.

Design Attribute #5 - Cohesion • Each of the modules are focused on a narrow set of activities: • Executive - manage servers • FileHandler - generate file set • Scanner - analyze file dependencies • Depends - store dependency • View - present data views • Their activities can all be summarized in a sentence. • That does not mean that their operations are simple, e.g., Scanner.

Design Attribute #6 - Coupling • Data transfer between modules is simple in almost all cases: • Executive to FileHandler: path/pattern string, file set reference • Executive to Scanner: file set reference • Scanner to Depends: two file names • View to Depends: sends file name, gets back array of dependent file names • Executive to View: see next slide!

Executive to View Coupling • Issue here is which module owns the display device - console or form. • It seems most natural to me to: • Have View return a, possibly sorted, set of selected rows. • Have Executive own the display, accept and handle all user requests, and simply ask View for the appropriate data by sending it a set of file names • allow patterns so “*” gets everything • View will return a reference to Dependency after sorting or will create a new instance populated with a subset of rows.

Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example