C-REX: An Evolutionary Code Extractor for C

C-REX: An Evolutionary Code Extractor for C Ahmed E. Hassan and Richard C. Holt University Of Waterloo

Traditional Code Extractor • Examines a snapshotof the code • Examples: • Rigiparse, CIA, CFX, CPPX, • Produce facts such as: • function_1 calls function_2 • function_1 uses variable_1 Source Code Snapshot Traditional Extractor FACTS

Evolutionary Code Extractor • Examines multiple snapshots of the code S0 S1 .. St St+1 Traditional Extractor F0 Ft Ft+1 F1 .. Compare Snapshot Facts Evolutionary Change Data

Evolutionary Code Extractor • Produces facts about: • Addition, removal, or modification of code entities (functions/variables/macros): • function_1 is added/removed/modified • Dependencies between code entities: • function_1 no longer calls function_2

The Need for Evolutionary Change Data • Assist in understanding how source code evolves and how code changes propagate: • Build better software development tools: • Measure the benefits of not-yet-existing tools • Measure the value of adopting development tools or methodologies • Monitor the quality and state of software systems: • Examine if changes are localized or scattered

Motivating Example main() { int a; /*call help*/ helpInfo(); } helpInfo() { errorString! } main() { int a; /*call help*/ helpInfo(); } helpInfo(){ int b; } main() { int a; /*call help*/ helpInfo(); } V1: Undefined func. (Link Error) V2: Syntax error V3: Valid code

Challenges • Robustness -- code is always changing: • Release frequency - too coarse • Change frequency - code may be invalid/incomplete (lookahead techniques) • Accuracy of extracted data (AST level is too complex): • Do not go to AST level, to make life easier  • Scalability/performance to extract large long-lived systems • Complexity of building the extractor and effort required: • Adopt off the shelf components

Evolutionary Code Extractor Using a traditional snapshot extractor is not feasible S0 S1 .. St St+1 Traditional Extractor F0 Ft Ft+1 F1 .. Compare Snapshot Facts Evolutionary Change Data

ctags • Can tag start and end of code entities • Used by source editors for highlighting and navigational support • Contains a variety of heuristics to handle incomplete and complex code (ifdef’s and K&R vs. ANSI C) • Supports over 30 languages • Actively maintained and highly optimized

Our Solution – C-REX • Use CVS to acquire thousands of code snapshots • Use ctags to assist in the parsing and code analysis • Use Perl scripts to drive the ctags analysis • Attach additional CVS data

C-REX Evolutionary Change Data Schema (1/2)

C-REX Schema (2/2)

C-REX Implementation Overview: Simple Example

Implementation Overview • Revision Data Extraction • Retrieve Revision Details from CVS: • File revisions • Developer name • Change message • Other changed files • Entity Extraction Using ctags • Record start and end of each entity and contents • Build Historical Symbol Table: • helpInfo • Main • helpInfo2

Implementation Overview • Entity Analysis – Create 3 buckets: • Code bucket • Comment bucket • Control bucket • Token Change Analysis • Dependency Change Analysis using Historical Symbol Table • Attaching Revision Data to Recovered Change Data

C-REX Output Size (in MB)

C-REX Limitations • Performance: • 10 yrs project (NetBSD) takes 12 hrs • RAID drives to improve performance • Parallelize the extraction • Dependency Analysis: • Does not consider the build system (Makefiles) • Dependency linking windows • Beyond C and CVS

Conclusions • Introduced evolutionary code extractor -- a new type of code extractor that extracts the evolutionary history of a project • Discussed the challenges associated with building such an extractor • Presented the implementation of C-REXand highlighted its limitations

C-REX: An Evolutionary Code Extractor for C