reading & understanding code

reading & understanding code • experts are better at code comprehension because they focus on higher level patterns • patterns can be considered “discourse rules” • naming conventions, design patterns, schemas • experts work significantly better when reading & writing code according to these patterns

reading & understanding code program comprehension expertise effects mental models tools

outline • mental models • types • models • conventions & “discourse rules” • expertise effects • tool implications • interesting tools

mental model • explanation of a someone’s thought process when carrying out a task • our someone: programmers • our task: program comprehension • several models exist

mental model classes • bottom-up • read code statement by statement then ascend for a higher-level picture • top-down • start with a high-level picture of what the code is doing then descend into code • mixed • incorporate elements from both, based on the situation

bottom-up mental models • 1st: read code statements • 2nd: chunking: group statements as abstractions • 3rd: repeat

chunking sequence chunk 1 chunk n chunk 2 element 1 element k element 2 modified from wikipedia

chunking • program model • reasoning about the order of computation, how control moves throughout a program • “control flow” • situation model • reason about how data moves through atomic models • “data flow” N. Pennington Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs Cognitive Psychology, 1987

program & situation model studies • participants first primed for either control flow or data flow • shown a piece of code, asked to recall another piece of code which is related through either control flow or data flow • participants then asked a question that relates to either control or data flow • participants primed to think about control flow answered other control-flow questions faster, same with data flow N. Pennington Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs Cognitive Psychology, 1987

types of programmer knowledge • semantic: general programming concepts • low-level knowledge, e.g. what a=1 means • high-level knowledge, e.g. sorting algorithms • syntactic: language detail • overlaps between languages • stylistic: programming conventions • “discourse rules” B. Shneiderman and R. Mayer Syntactic/Semantic Interactions in Programmer Behavior: A Model and Experimental Results Journal of Computer & Information Sciences, 1979 E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

problem statement short term memory internal semantics (working memory) program high level concepts low level concepts knowledge (long term memory) semantic knowledge syntactic knowledge high level concepts COBOL FORTRAN PL/I LISP low level concepts B. Shneiderman and R. Mayer Syntactic/Semantic Interactions in Programmer Behavior: A Model and Experimental Results Journal of Computer & Information Sciences, 1979

evidence forsemantic & syntactic knowledge • lab studies using FORTRAN • participants: programmers and non-programmers • asked to perform tasks that used one type of knowledge • six studies (will describe two) B. Shneiderman and R. Mayer Syntactic/Semantic Interactions in Programmer Behavior: A Model and Experimental Results Journal of Computer & Information Sciences, 1979

program memorization • study • two subject types: non-programmers & programmers • two program versions: normal & shuffled • participants asked to memorize a program • results • non-programmers performed equally poorly with normal & shuffled programs • programmers performed poorly with shuffled program, well with normal • were able to remember semantic details with syntactic variations • conclusion • programmers were not memorizing the program, but internal semantics to represent its function B. Shneiderman and R. Mayer Syntactic/Semantic Interactions in Programmer Behavior: A Model and Experimental Results Journal of Computer & Information Sciences, 1979

commenting • study • two program versions • 5-line high-level block comment at top • numerous interspersed low-level comments • participants asked to make modifications to program & memorize program • result • high-level comment participants performed better • strong correlation between ability to make modifications and ability to memorize • conclusion • memorization is a strong correlate to comprehension • hierarchical chunking to organize statements into a unit facilitate comprehension process B. Shneiderman and R. Mayer Syntactic/Semantic Interactions in Programmer Behavior: A Model and Experimental Results Journal of Computer & Information Sciences, 1979

top-down models • 1st: develop hypotheses about the program • 2nd: evaluate and refine hypotheses • with the help of beacons • 3rd: repeat • a process of “reconstructing knowledge”

beacons • “indexes into existing knowledge” • recognizable features in that are cues to the presence of certain structures • e.g., looking for a listener pattern M. Storey Theories, Methods, and Tools in Program Comprehension: Past, Present, and Future IEEE Workshop on Program Comprehension, 2005 R. Brooks Towards a theory of the comprehension of computer programs International J. on Man-Machine Studies, 1981

beacon types • semantic knowledge “plans” • reusable generic program fragments • high-level or low-level • programming discourse conventions • “rules” that make program comprehension easier • found across programmers E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

brooks’ model problem external representation requirement documentation program code design document match beacons beacons beacons syntactic knowledge semantic knowledge verify internal schema vs external representation internal representation –hypotheses and subgoals R. Brooks Towards a theory of the comprehension of computer programs International J. on Man-Machine Studies, 1981 modified from Jonathan I. Maletic’sslides: An Overview of Mental Models for Program Understanding

opportunistic & systematic strategies • programmers enhancing existing program • two strategies: • systematically read code in detail, tracing through control and data flow manually • developed control and data flow knowledge • focus only on code relevant to a task • developed only control flow knowledge, resulted in a weaker understanding Margaret-Anne Storey Theories, Methods, and Tools in Program Comprehension: Past, Present, and Future Int. Workshop on Program Comprehension, 2005

integrated model • maintainers switch between top-down and bottom-up comprehension • top-down if code or code type is familiar • program model (control-flow) when code is completely unfamiliar • situation model (data-flow) after a partial data-flow understanding is developed through top-down or program model methods • knowledge base: information from previous three models Margaret-Anne Storey Theories, Methods, and Tools in Program Comprehension: Past, Present, and Future Int. Workshop on Program Comprehension, 2005 A. von Mayrhauser and A.M. Vans From Program Comprehension to Tool Requirements for an Industrial Environment IEEE Workshop on Program Comprehension, 1993

validating the integrated model • taped professional maintenance programmers • worked with a large code base • classified as domain and language experts • tape transcriptions classified into model types • one of few studies with real world tasks

programming discourse rules • specify the conventions of programming • e.g., a variable’s name should reflect its function • e.g., don’t include code that won’t be used • similar to writing discourse rules, as outlined in books like Elements of Style • e.g., you expect to find the description for fig. 7 between those for fig. 6 and fig. 8 E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

rules of programming discourse • variable names should reflect function • don’t include code that won’t be used • if there is a test for a condition, then the condition must have the potential of being true • a variable that is initialized via an assignment statement should be updated via an assignment statement • don’t do double duty with code in a non-obvious way • an if should be used when a statement body is guaranteed to be executed only once, and a while used when a statement body may need to be repeatedly executed E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

testing discourse rules • lab study with expert & novice programmers • two program types • α (plan-like): obeyed discourse rules • β (un-plan-like): disobeyed discourse rules • participants given either α or β code, with one blank • task:fill the blank with what seems “natural” • participants were not told about α or β code • conclusion: experts fared best with α code

why have un-plan-like (β) code? • machine limitations • limited memory, processing, bandwidth, etc. • language limitations • less common. bugs, efficiency issues, etc. • programmer limitations • does not have full mastery of discourse • historical traces • resistance to changing legacy code, permanent “temporary” code source: The Psychology of Computer Programming

XXX: PROCEDURE OPTIONS(MAIN); DECLARE B(1000) FIXED(7,2), C FIXED(11,2), (I, J) FIXED BINARY; C = 0; DO I = 1 TO 10; GET LIST((B(J) DO J = 1 TO 1000)); DO J = 1 TO 1000; C = C + B(J); END; END; PUT LIST(‘RESULT IS ’, C); END XXX; modified from The Psychology of Computer Programming

XXX: PROCEDURE OPTIONS(MAIN); DECLARE A(1000) FIXED(7,2), C FIXED(11,2), I FIXED BINARY; C = 0; GET LIST((A(J) DO I = 1 TO 10000)); DO I = 1 TO 10000; C = C + B(I); END; PUT LIST(‘RESULT IS ’, C); END XXX; modified from The Psychology of Computer Programming

rules of programming discourse • variable names should reflect function • don’t include code that won’t be used • if there is a test for a condition, then the condition must have the potential of being true • a variable that is initialized via an assignment statement should be updated via an assignment statement • don’t do double duty with code in a non-obvious way • an if should be used when a statement body is guaranteed to be executed only once, and a while used when a statement body may need to be repeatedly executed E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

naming conventions • meaningful names • variable naming reflects cognitive structure • grammatical sensibility • interact with language spec. to form expressions • containers & paths • objects & pointers • polysemy, homonymy, & overloading • operators, name sharing B. Liblit, A. Begel, and E. Sweetser Cognitive Perspectives on the Role of Naming in Computer Programs Psychology of Programming Interest Group, 2006

meaningful names • metaphors for domain tasks • e.g. pushing objects onto a stack • keywords for grouping • e.g. common prefixes & suffixes • informative names • balanced with name length A. Blackwell Metaphor or analogy: how should we see programming abstractions? Psychology of Programming Interest Group, 1996 B. Liblit, A. Begel, and E. Sweetser Cognitive Perspectives on the Role of Naming in Computer Programs Psychology of Programming Interest Group, 2006

name length • length harm readability and recall ability • idioms and memory ties improve readability and recall ability • takeaway: variable names with consistent and abbreviated vocabulary are optimal • (variable names that concisely express a metaphor) D. Binkley, D. Lawrie, S. Maex, and C. Morrell Identifier length and limited programmer memory Science of Computer Programming, 2009

grammatical sensibility • names as phrase fragments • methods as actions (change state of program) • e.g. addElement, setSize, removeAll • methods as mathematical functions (compute result, don’t alter state) • e.g. true/false: contains, equals, isEmpty • e.g. data: capacity, indexOf, size • valence cues (phrase fragments w/ open slot) • e.g. roster.contains(player) • smalltalk makes use of this extensively: • roster insert: player at: position B. Liblit, A. Begel, and E. Sweetser Cognitive Perspectives on the Role of Naming in Computer Programs Psychology of Programming Interest Group, 2006

20:1 programmer performance • Sackman et al.: best programmers are 20xbetter than worst programmers @ bug fixing • study originally meant to evaluate the effectiveness of time-shared systems H. Sackman, W. J. Erikson, and E. E. Grant Exploratory experimental studies comparing online and offline programming performance Communications of the ACM, 1968

10:1 programmer performance • there are substantial programmer efficiency differences, but not as dramatic as initially reported • what makes experts so much better at understanding code?

testing discourse rules • lab study with expert & novice programmers • two program types • α (plan-like): obeyed discourse rules • β (un-plan-like): disobeyed discourse rules • participants given either α or β code, with one blank • task:fill the blank with what seems “natural” • participants were not told about α or β code

α problem PROGRAM Magenta(input, output) VAR Max, I, Num INTEGER BEGIN Max = 0. FOR I = 1 TO 10 DO BEGIN READLN(Num) If Num Max THEN Max = Num END WRITELN(Max). END ? E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

α solution PROGRAM Magenta(input, output) VAR Max, I, Num INTEGER BEGIN Max = 0. FOR I = 1 TO 10 DO BEGIN READLN(Num) If Num > Max THEN Max = Num END WRITELN(Max). END E. Soloway, K. Ehrlich Empirical Studies of Programming Knowledge IEEE Transactions of Software Engineering, 1984

reading & understanding code