Course Notes for CS1621 Structure of Programming Languages Part A By John C. Ramirez Department of Computer Science Univ

1. Course Notes forCS1621 Structure of Programming LanguagesPart AByJohn C. RamirezDepartment of Computer ScienceUniversity of Pittsburgh

2. 2 These notes are intended for use by students in CS1621 at the University of Pittsburgh and no one else These notes are provided free of charge and may not be sold in any shape or form Material from these notes is obtained from various sources, including, but not limited to, the textbooks: Concepts of Programming Languages, Seventh Edition, by Robert W. Sebesta (Addison Wesley) Programming Languages, Design and Implementation, Fourth Edition, by Terrence W. Pratt and Marvin V. Zelkowitz (Prentice Hall) Compilers Principles, Techniques, and Tools, by Aho, Sethi and Ullman (Addison Wesley)

3. 3 Goals of Course To survey the various programming languages, their purposes and their histories Why do we have so many languages? How did these languages develop? Are some languages �better� than others for some things? To examine methods for describing language syntax and semantics

4. 4 Goals of Course Syntax indicates structure of program code How can language designer specify this? How can programmer learn language? How can compiler recognize this? Lexical analysis Parsing (syntax analysis) Brief discussing of parsing techniques Semantics indicate meaning of the code What code will actually do Can we effectively do this in a formal way? Static semantics Dynamic semantics

5. 5 Goals of Course To examine some language features and constructs and how they are used and implemented in various languages Variables and constants Types, binding and type checking Scope and lifetime Data Types Primitive types Array types Structured data types

6. 6 Goals of Course Pointer (reference) types Assignment statements and expressions Operators, precedence and associativity Type coercions and conversions Boolean expressions and short-circuit evaluation Control statements Selection Iteration Unconditional branching � goto

7. 7 Goals of Course Process abstraction: procedures and functions Parameters and parameter-passing Generic subprograms Nonlocal environments and side-effects Implementing subprograms Subprograms in static-scoped languages Subprograms in dynamic-scoped languages

8. 8 Goals of course Data abstraction and abstract data types Object-oriented programming Design issues Implementations in various object-oriented languages Concurrency Concurrency issues Subprogram level concurrency Implementations in various languages Statement level concurrency

9. 9 Goals of Course Exception handling Issues and implementations IF TIME PERMITS Functional programming languages Logic programming languages

10. 10 Language Development Issues Why do we have high-level programming languages? Machine code is too difficult for us to read, understand and debug Machine code is not portable between architectures Why do we have many high-level programming languages? Different people and companies developed them

11. 11 Language Development Issues Different languages are either designed to or happen to meet different programming needs Scientific applications FORTRAN Business applications COBOL AI LISP, Scheme (& Prolog) Systems programming C Web programming Perl, PHP, Javascript �General� purpose C++, Ada, Java

12. 12 Language Development Issues Programming language qualities and evaluation criteria Readability How much can non-author understand logic of code just by reading it? Is code clear and unambiguous to reader? These are often subjective, but sometimes is is fairly obvious Examples of features that help readability: Comments Long identifier names Named constants What are some issues that are important when designing, choosing or evaluating programming languages? Readability example: C++: if (score <= 100) if (score >= 90) cout << �A� << endl; else cout << �A+�; Ada: if (score <= 100) then if (score >= 90) put (�A�); end if; else put (�A+�); end if;What are some issues that are important when designing, choosing or evaluating programming languages? Readability example: C++: if (score <= 100) if (score >= 90) cout << �A� << endl; else cout << �A+�; Ada: if (score <= 100) then if (score >= 90) put (�A�); end if; else put (�A+�); end if;

13. 13 Language Development Issues Clearly understood control statements Language orthogonality Simple features combine in a consistent way But it can go too far, as explained in the text about Algol 68 Writability Not dissimilar to readability How easy is it for programmer to use language effectively? Can depend on the domain in which it is being used Ex: LISP is very writable for AI applications but would not be so good for systems programming Also somewhat subjective Orthogonality example: Class declaration in Java allows programmer to create new type Array declaration in Java can be of any Java type So: Now programmer can make arrays of primitive types, arrays of objects, arrays of arrays, etc.Orthogonality example: Class declaration in Java allows programmer to create new type Array declaration in Java can be of any Java type So: Now programmer can make arrays of primitive types, arrays of objects, arrays of arrays, etc.

14. 14 Language Development Issues Examples of features that help writability: Clearly understood control statements Subprograms Also orthogonality Reliability Two different ideas of reliability Programs are less susceptible to logic errors Ex: Assignment vs. comparison in C++ See assign.cpp and assign.java Programs have the ability to recover from exceptional situations Exception handling � we will discuss more later Ex: C error with = meant as comparison Ex: Compare to JavaEx: C error with = meant as comparison Ex: Compare to Java

15. 15 Language Design Issues Many factors influence language design Architecture Most languages were designed for single processor �von Neumann� type computers CPU to execute instructions Data and instructions stored in main memory General language approach Imperative languages Fit well with von Neumann computers Focus is on variables, assignment, selection and iteration Examples: FORTRAN, Pascal, C, Ada, C++, Java Draw picture of von Neumann machine on board Draw a few statements (assignment, etc)Draw picture of von Neumann machine on board Draw a few statements (assignment, etc)

16. 16 Language Design Issues Imperative language evolution Simple �straight-line� code Top-down design and process abstraction Data abstraction and ADTs Object-oriented programming Some consider object-oriented languages not to be imperative, but most modern oo languages have imperative roots (ex. C++, Java) Functional languages Focus is on function and procedure calls Mimics mathematical functions Less emphasis on variables and assignment In strictest form has no iteration at all � recursion is used instead Examples: LISP, Scheme See example Give short example of a LISP program segment (equal �(a b) (car � ((a b) c d e)))Give short example of a LISP program segment (equal �(a b) (car � ((a b) c d e)))

17. 17 Language Design Issues Logic programming languages Symbolic logic used to express propositions, rules and inferences Programs are in a sense theorems User enters a proposition and system uses programmer�s rules and propositions in an attempt to �prove� it Typical outputs: Yes: Proposition can be established by program No: Proposition cannot be established by program Example: Prolog � see example program Cost What are the overall costs associated with a given language? How does the design affect that cost? Give simple prolog exampleGive simple prolog example

18. 18 Language Design Issues Training programmers How easy is it to learn? Writing programs Is language a good fit for the task? Compiling programs How long does it take to compile programs? This is not as important now as it once was Executing programs How long does program take to run? Often there is a trade-off here Ex: Java is slower than C++ but it has many run-time features (array bounds checking, security manager) that are lacking in C++

19. 19 Language Implementation Issues How is HLL code processed and executed on the computer? Compilation Source code is converted by the compiler into binary code that is directly executable by the computer Compilation process can be broken into 4 separate steps: Lexical Analysis Breaks up code into lexical units, or tokens Examples of tokens: reserved words, identifiers, punctuation Feeds the tokens into the syntax analyzer

20. 20 Language Implementation Issues Syntax Analysis Tokens are parsed and examined for correct syntactic structure, based on the rules for the language Programmer syntax errors are detected in this phase Semantic Analysis/Intermediate Code Generation Declaration and type errors are checked here Intermediate code generated is similar to assembly code Optimizations can be done here as well, for example: Unnecessary statements eliminated Statements moved out of loops if possible Recursion removed if possible

21. 21 Language Implementation Issues Code Generation Intermediate code is converted into executable code Code is also linked with libraries if necessary Note that steps 1) and 2) are independent of the architecture � depend only upon the language � Front End Step 3) is somewhat dependent upon the architecture, since, for example, optimizations will depend upon the machine used Step 4) is clearly dependent upon the architecture � Back End

22. 22 Language Implementation Issues Interpreting Program is executed in software, by an interpreter Source level instructions are executed by a �virtual machine� Allows for robust run-time error checking and debugging Penalty is speed of execution Example: Some LISP implementations, Unix shell scripts and Web server scripts

23. 23 Language Implementation Issues Hybrid First 3 phases of compilation are done, and intermediate code is generated Intermediate code is interpreted Faster than pure interpretation, since the intermediate codes are simpler and easier to interpret than the source codes Still much slower than compilation Examples: Java and Perl However, now Java uses JIT Compilation also Method code is compiled as it is called so if it is called again it will be faster

24. 24 Brief, Incomplete PL History Early 50�s Early HLL�s started to emerge FORTRAN Stands for FORmula TRANslating system Developed by a team led by John Backus at IBM for the IBM 704 machine Successful in part because of support by IBM Designed for scientific applications The root of the imperative language tree

25. 25 Brief PL History Lacked many features that we new take for granted in programming languages: Conditional loops Statement blocks Recursive abilities Many of these features were added in future versions of FORTRAN FORTRAN II, FORTRAN IV, FORTRAN 77, FORTRAN 90 Had some interesting features that are now obsolete COMMON, EQUIVALENCE, GOTO We may discuss what these are later

26. 26 Brief PL History Late 50�s COBOL COmmon Business Oriented Language Developed by US DoD Separated data and procedure divisions But didn�t allow functions or parameters Still widely used, due in part to the large cost of rewriting software from scratch Big companies would rather maintain COBOL programs than rewrite them in a different language

27. 27 Brief PL History LISP LISt Processing Developed by John McCarthy of MIT Functional language Good for symbolic manipulation, list processing Had recursion and conditional expressions Not in original FORTRAN At one time used extensively for AI Today most widely used version, COMMON LISP, has included some imperative features

28. 28 Brief PL History ALGOL ALGOL 58 and then ALGOL 60 both designed by international committee Goals for the language: Syntax should be similar to mathematical notation and readable Should be usable for algorithms in publications Should be compilable into machine code Included some interesting features Pass by value and pass by name (wacky!) parameters Recursion (first in an imperative language) Dynamic arrays Block structure and local variables

29. 29 Brief PL History Introduced Backus-Naur Form (BNF) as a way to describe the language syntax Still commonly used today, but not well-accepted at the time Never widely used, but influenced virtually all imperative languages after it

30. 30 Brief PL History Late 60�s Simula 67 Designed for simulation applications Introduced some interesting features Classes for data abstraction Coroutines for re-entrant subprograms ALGOL 68 Emphasized orthogonality and user-defined data types Not widely used

31. 31 Brief PL History 70�s Pascal Developed by Nicklaus Wirth No major innovations, but due to its simplicity and emphasis of good programming style, became widely used for teaching C Developed by Dennis Ritchie to help implement the Unix operating system Has a great deal of flexibility, esp. with types Incomplete type checking

32. 32 Brief PL History Void pointers Coerces many types Many programmers (esp. systems programmers) love it Language purists hate it Easy to miss logic errors Prolog Logic programming We discussed it a bit already Still used somewhat, mostly in AI May discuss in more detail later

33. 33 Brief PL History 80�s Ada Developed over a number of years by DoD Goal was to have one language for all DoD applications Especially for embedded systems Contains some important features Data encapsulation with packages Generic packages and subprograms Exception handling Tasks for concurrent execution We will discuss some of these later

34. 34 Brief PL History Very large language � difficult to program reliably, even though reliability was one of its goals! Early compilers were slow and error-prone Did not have the widespread general use that was hoped Eventually the government stopped requiring it for DoD applications Use faded after this Not used widely anymore Ada 95 added object-oriented features Still wasn't used much, especially with the advent of Java and other OO languages

35. 35 Brief PL History Smalltalk Designed and developed by Alan Kay Concepts developed in 60�s, but language did not come to fruition until 1980 Designed to be used on a desktop computer 15 years before desktop computers existed First true object-oriented language Language syntax is geared toward objects �messages� passed between objects �methods� are invoked as responses to messages Always dynamically bound All classes are subclasses of Object Also included software devel. environment Had large impact on future OOLs, esp. Java

36. 36 Brief PL History C++ Developed largely by Bjarne Stroustrup as an extension to C Backward compatible Added object-oriented features and some additional typing features to improve C Very powerful and very flexible language But still has reliability problems Ex. no array bounds checking Ex. dynamic memory allocation Widely used and likely to be used for a while longer

37. 37 Brief PL History Perl Developed by Larry Wall Takes features from C as well as scripting languages awk, sed and sh Some features: Regular expression handling Associative arrays Implicit data typing Originally used for data extraction and report generation Evolved into the archetypal Web scripting language Has many proponents and detractors

38. 38 Brief PL History 90�s Java Interestingly enough, just like Ada, Java was originally developed to be used in embedded systems Developed at Sun by a team headed by James Gosling Syntax borrows heavily from C++ But many features (flaws?) of C++ have been eliminated No explicit pointers or pointer arithmetic Array bounds checking Garbage collection to reclaim dynamic memory

39. 39 Brief PL History Object model of Java actually more resembles that of Smalltalk rather than that of C++ All variables are references Class hierarchy begins with Object Dynamic binding of method names to operations by default But not as pure in its OO features as Smalltalk, due to its imperative control structures Interpreted for portability and security Also JIT compilation now Growing in popularity, largely due to its use on Web pages

40. 40 Brief PL History 00's (aughts? oughts? naughts?) See http://www.randomhouse.com/wotd/index.pperl?date=19990803 C# Main roots in C++ and Java with some other influences as well Used with the MS .NET programming environment Some improvements and some deprovements compared to Java Likely to succeed given MS support

41. 41 Program Syntax Recall job of syntax analyzer: Groups (parses) tokens (fed in from lexical analyzer) into meaningful phrases Determines if syntactic structure of token stream is legal based on rules of the language Let�s look at this in more detail How does compiler �know� what is legal and what is not? How does it detect errors?

42. 42 Program Syntax To answer these questions we must look at programming language syntax in a more formal way Language: Set of strings of lexemes from some alphabet Lexemes are the lowest level syntactic elements Lexemes are made up of characters, as defined by the character set for the language

43. 43 Program Syntax Lexemes are categorized into different tokens and processed by the lexical analyzer Ex: Lexemes: if, (, width, <, height, ), {, cout, <<, width, <<, endl, ;, } Tokens: iftok, lpar, idtok, lt, idtok, rpar, lbrace, idtok, llt, idtok, llt, idtok, semi, rbrace Note that some tokens correspond to single lexemes (ex. iftok) whereas some correspond to many (ex. idtok)

44. 44 Program Syntax How do we formally define a language? Assume we have a language, L, defined over an alphabet, ?. 2 related techniques: Recognition An algorithm or mechanism, R, will process any given string, S, of lexemes and correctly determine if S is within L or not Not used for enumeration of all strings in L Used by parser portion of compiler

45. 45 Program Syntax Generation Produces valid �sentences� of L Not as useful as recognition for compilation, since the valid �sentences� could be arbitrary More useful in understanding language syntax, since it shows how the sentences are formed Recognizer only says if sentence is valid or not � more of a trial and error technique

46. 46 Program Syntax So recognizers are what compilers need, but generators are what programmers need to understand language Luckily there are systematic ways to create recognizers from generators Thus the programmer �reads� the generator to understand the language, and a recognizer is created from the generator for the compiler

47. 47 Language Generators Grammar A mechanism (or set of rules) by which a language is generated Defined by the following: A set of non-terminal symbols, N Do not actually appear in strings A set of terminal symbols, T Appear in strings A set of productions, P Rules used in string generation A starting symbol, S

48. 48 Language Generators Noam Chomsky described four classes of grammars (used to generate four classes of languages) � Chomsky Hierarchy )Unrestricted Context-sensitive Context-free Regular More info on unrestricted and context-sensitive grammars in a theory course The last two will be useful to us

49. 49 Language Generators Regular Grammars Productions must be of the form: <non> ? <ter><non> | <ter> where <non> is a nonterminal, <ter> is a terminal, and | represents either or Can be modeled by a Finite-State Automaton (FSA) Also equivalent to Regular Expressions Provide a model for building lexical analyzers

50. 50 Language Generators Have following properties (among others) Can generate strings of the form ?n, where ? is a finite sequence and n is an integer Pattern recognition Can count to a finite number Ex. { an | n = 85 } But we need at least 86 states to do this Cannot count to arbitrary number Note that { an } for any n (i.e. 0 or more occurrences) is easy � do not have to count Important to realize that the number of states is finite: cannot recognize patterns with an arbitrary number of possibilities

51. 51 Language Generators Example: Regular grammar to recognize Pascal identifiers (assume no caps) N = {Id, X} T = {a..z, 0..9} S = Id P = Id ? aX | bX | � | a | b | � | z X ? aX | bX | � | 0X | � | 9X | a | � | z | 0 | � | 9 Consider equiv. FSA:

52. 52 Language Generators Example: Regular grammar to generate a binary string containing an odd number of 1s N = {A,B} T = {0,1} S = A P = A ? 0A | 1B | 1 B ? 0B | 1A | 0 Example: Regular grammars CANNOT generate strings of the form anbn Grammar needs some way to count number of a�s and b�s to make sure they are the same Any regular grammar (or FSA) has a finite number, say k, of different states If n > k, not possible

53. 53 Language Generators If we could add a �memory� of some sort we could get this to work Context-free Grammars Can be modeled by a Push-Down Automaton (PDA) FSA with added push-down stack Productions are of the form: <non> ? ?, where <non> is a nonterminal and ? is any sequence of terminals and nonterminals note rhs is more flexible now

54. 54 Language Generators So how to generate anbn ? Let a=0, b=1 N = {A} T = {0,1} S = A P = A ? 0A1 | 01 Note that now we can have a terminal after the nonterminal as well as before Can also have multiple nonterminals in a single production Example: Grammar to generate sets of balanced parentheses N = {A} T = {(,)} S = A P = A ? AA | (A) | ()

55. 55 Language Generators Context-free grammars are also equivalent to BNF grammars Developed by Backus and modified by Naur Used initially to describe Algol 60 Given a (BNF) grammar, we can derive any string in the language from the start symbol and the productions A common way to derive strings is using a leftmost derivation Always replace leftmost nonterminal first Complete when no nonterminals remain

56. 56 Language Generators Example: Leftmost derivation of nested parens: (()(())) A ? (A) ? (AA) ? (()A) ? (()(A)) ? (()(())) We can view this derivation as a tree, called a parse tree for the string

57. 57 Language Generators Parse tree for (()(()))

58. 58 Language Generators If, for a given grammar, a string can be derived by two or more different parse trees, the grammar is ambiguous Some languages are inherently ambiguous All grammars that generate that language are ambiguous Many other languages are not themselves ambiguous, but can be generated by ambiguous grammars It is generally better for use with compilers if a grammar is unambiguous Semantics are often based on syntactic form

59. 59 Language Generators Ambiguous grammar example: Generate strings of the form 0n1m, where n,m >= 1 N = {A,B,C} T = {0,1} S = A P = A ? BC | 0A1 B ? 0B | 0 C ? 1C | 1 Consider the string: 00011

60. 60 Language Generators We can easily make this grammar unambiguous: Remove production: A ? 0A1 Note that nonterminal B can generate an arbitrary number of 0s and nonterminal C can generate an arbitrary number of 1s Now only one parse tree

61. 61 Language Generators Let�s look at a few more examples Grammar to generate: {WWR | W ? {0,1} } N = {A} T = {0,1} S = A P = ? Grammar to generate: strings in {0,1} of the form WX such that |W| = |X| but W != X This one is a little tricker How to approach this problem? We need to guarantee two things Overall string length is even At least one bit differs in the two �halves�

62. 62 Language Generators See board Ok, now how do we make a grammar to do this? Make every string (even length) the result of two odd-length strings appended to each other Assume odd-length strings are Ol and Or Make sure that either Ol has a 1 in the middle and Or has a 0 in the middle or Ol has a 0 in the middle and Or has a 1 in the middle Productions:

63. 63 Language Generators Let�s look at an example more relevant to programming languages: Grammar to generate simple assignment statements in a C-like language (diff. from one in text): <assig stmt> ::= <var> = <arith expr> <arith expr> ::= <term> | <arith expr> + <term> | <arith expr> - <term> <term> ::= <primary> | <term> * <primary> | <term> / <primary> <primary> ::= <var> | <num> | (<arith expr>) <var> ::= <id> | <id>[<subscript list>] <subscript list> ::= <arith expr> | <subscript list>, <arith expr>

64. 64 Language Generators Parse tree for: X = (A[2]+Y) * 20

65. 65 Language Generators Wow � that seems like a very complicated parse tree to generate such a short statement Extra non-terminals are often necessary to remove ambiguity Extra non-terminals are often necessary to create precedence Precedence in previous grammar has * and / higher than + and / They would be �lower� in the parse tree �LOWER� ABOVE IS CORRECT What about associativity Left recursive productions == left associativity Right recursive productions == right associativity

66. 66 Language Generators But Context-free grammars cannot generate everything Ex: Strings of the form WW in {0,1} Cannot guarantee that arbitrary string is the same on both sides Compare to WWR These we can generate from the �middle� and build out in each direction For WW we would need separate productions for each side, and we cannot coordinate the two with a context-free grammar Need Context-Sensitive in this case

67. 67 Language Generators Let�s look at one more grammar example Grammar to generate all postfix expressions involving binary operators * and -. Assume <id> is predefined and corresponds to any variable name Ex: v w x y - * z * - How do we approach this problem? Terminals � easy Nonterminals/Start � require some thought Productions � require a lot of thought

68. 68 Language Generators T = { <id>, *, - } N = { A } S = { A } P = A ? AA* | AA- | <id> Show parse tree for previous example Is this grammar LL(1)? We will discuss what this means soon

69. 69 Parsers Ok, we can generate languages, but how to recognize them? We need to convert our generators into recognizers, or parsers We know that a Context-free grammar corresponds to a Push-Down Automaton (PDA) However, the PDA may be non-deterministic As we saw in examples, to create a parse tree we sometimes have to �guess� at a substitution

70. 70 Parsers May have to �guess� a few times before we get the correct answer This does not lend itself to programming language parsing We�d like parser to never have to �guess� To eliminate guessing, we must restrict the PDAs to deterministic PDAs, which restricts the grammars that we can use Must be unambiguous Some other, less obvious restrictions, depending upon parsing technique used

71. 71 Parsers There are two general categories of parsers Bottom-up parsers Can parse any language generated by a Deterministic PDA Build the parse trees from the leaves up back to the root as the tokens are processed At each step, a substring that matches the right-hand side of a production is substituted with the left side of the production �Reduces� input string all the way back to the start symbol for the grammar Also called �shift-reduce� parsing

72. 72 Parsers Correspond LR(k) grammars Left to right processing of string Rightmost derivation of parse tree (in reverse) k symbols lookahead required LR parsers are difficult to write by hand, but can be produced systematically by programs such as YACC (Yet Another Compiler Compiler). Primary variations of LR grammars/parsers SLR (Simple LR) LALR (Look Ahead LR) LR � most general but also most complicated to implement We'll leave details to CS 1622

73. 73 Parsers Top-down parsers Build the parse trees from the root down as the tokens are processed Also called predictive parsers, or LL parsers Left-to-right processing of string Leftmost derivation of parse tree The LL(1) that we saw before means we can parse with only one token lookahead More restrictive than LR parsers: there are grammars generated by Deterministic PDAs that are not LL grammars (i.e. cannot be parsed by an LL parser) Some restrictions on productions allowed Cannot handle left-recursion � we'll see why shortly

74. 74 Parsers Implementing a top-down parser One technique is Recursive Descent Can think of each production as a function As string of tokens is parsed, terminal symbols are consumed/processed and non-terminal symbols generate function calls Now we can see why left-recursive productions cannot be handled From Example 3.4 <expr> ? <expr> + <term> Recursion will continue indefinitely without consuming any symbols

75. 75 Parsers Luckily, in most cases a grammar with left recursion can be converted into one with only right-recursion Recursive Descent parsers can be written by hand, or generated Think of a program that processes the grammar by creating a function shell for each non-terminal Then details of function are filled in based upon the various right-hand sides the non-terminal generates See example

76. 76 LL(1) Grammars So how can we tell if a grammar is LL(1)? Given the current non-terminal (or left side of a production) and the next terminal we must be able to uniquely determine the right side of the production to follow Remember that a non-terminal can have multiple productions As we previously mentioned, the grammar must not be left recursive However, not having left recursion is necessary but not sufficient for an LL(1) grammar

77. 77 LL(1) Grammars Ex: A ? aX | aY We cannot determine which right side to follow without more information than just "a" How can we process a grammar to determine if this situation occurs? Calculate the First set for each RHS of productions First set of a sequence of symbols, S, is the set of terminals that begin the strings derived from S Given multiple RHS for nonterminal N N ? ?1 | ?2 If First(?1) and First(?2) intersect, the grammar is not LL(1)

78. 78 LL(1) Grammars So how do we calculate First() sets? Algorithm is given in Aho (see Slide 2) Consider symbol X If X is a terminal, First(X) = {X} If X ? e is a production, add e to First(X) If X is a nonterminal and X ? Y1Y2�Yk is a production Add a to First(X) if, for some i, a is in First(Yi) and e is in all of First(Y1) � First(Yi-1) Add e to First(X) if e is in First(Yj) for all j = 1, 2, � k To calculate First(X1X2�Xn) for some X1X2�Xn Add non-e symbols of First(X1) If e is in First(X1), and non-e symbols of First(X2) � If e is in all First(Xi), add e

79. 79 LL(1) Grammars A ? aB | b | cBB B ? aB | bA | aBb A ? aB | CD | E | e B ? b C ? cA | e D ? dA E ? dB LL1 Not LL1 Not LL1LL1 Not LL1 Not LL1

80. 80 Semantics Sematics indicate the �meaning� of a program What do the symbols just parsed actually say to do? Two different kinds of semantics: Static Semantics Almost an extension of program syntax Deals with structure more than meaning, but at a �meta� level Handles structural details that are difficult or impossible to handle with the parser Ex: Has variable X been declared prior to its use? Ex: Do variable types match?

81. 81 Semantics Dynamic Semantics (often just called semantics) What does the syntax mean? Ex: Control statements Ex: Parameter passing Programmer needs to know meaning of statements before he/she can use language effectively

82. 82 Semantics Static Semantics One technique for determining/checking static semantics is Attribute Grammars Start with a context-free grammar, and add to it: Attributes (for the grammar symbols) Indicate some properties of the symbols Attribute computation functions (semantic functions) Allow attributes to be determined Predicate functions Indicate the static semantic rules

83. 83 Semantics Attributes � made up of synthesized attributes and inherited attributes Synthesized Attributes Formed using attributes of grammar symbols lower in the parse tree Ex: Result type of an expression is synthesized from the types of the subexpressions Inherited Attributes Formed using attributes of grammar symbols higher in the parse tree Ex: Type of RHS of an assignment is expected to match that of LHS � the type is inherited from the type of the LHS variable

84. 84 Semantics Semantic Functions Indicate how attributes are derived, based on the static semantics of the language Ex: A = B + C; Assume A, B and C can be integers or floats If B and C are both integers, RHS result type is integer, otherwise it is float Predicate functions Test attributes of symbols processed to see if they match those defined by language Ex: A = B + C If RHS type attribute is not equal to LHS type attribute, error (in some languages)

85. 85 Semantics Detailed Example in text Grammar Rules 1) <assign> ? <var> = <expr> 2) <expr> ? <var> + <var> 3) <expr> ? <var> 4) <var> ? A | B | C Attributes actual_type: actual type of <var> or <expr> in question (synthesized, but for a <var> we say this is an intrinsic attribute) expected_type: associated with <expr>, indicating the type that it SHOULD be � inherited from actual_type of <var>

86. 86 Semantics Semantic functions Parallel to syntax rules of the grammar See Ex. 3.6 in text 1) <assign> ? <var> = <expr> <expr>.expected_type ? <var>.actual_type 2) <expr> ? <var>[2] + <var>[3] <expr>.actual_type ? if (<var>[2].actual_type = int) and (<var>[3].actual_type = int) then int else real end if 3) <expr> ? <var> <expr>.actual_type ? <var>.actual_type 4) <var> ? A | B | C <var>.actual_type ? look-up(<var>.string) Predicate functions Only one needed here � do the types match? <expr>.actual_type == <expr>.expected_type

87. 87 Semantics Ex: A = B + C

88. 88 Semantics Attribute grammars are useful but not typically used in their pure form for full-scale languages � makes the grammars more complicated and compilers more difficult to generate

89. 89 Semantics Dynamic Semantics (semantics) Clearly vital to the understanding of the language In early languages they were simply informal � like manual pages Efforts have been made in later years to formalize semantics, just as syntax has been formalized But semantics tend to be more complex and less precisely defined More difficult to formalize

90. 90 Semantics Some techniques have gained support however Operational Semantics Define meaning by result of execution on a primitive machine, examining the state of the machine before and after the execution Axiomatic Semantics Preconditions and postconditions define meaning of statements Used in conjunction with proofs of program correctness Denotational Semantics Map syntactic constructs into mathematical objects that model their meaning Quite rigorous and complex

91. 91 Identifiers, Reserved Words and Keywords Identifier String of characters used to name an entity within a program Most languages have similar rules for ids, but not always C++ and Java are case-sensitive, while Ada is not Can be a good thing: mixing case allows for longer, more readable names, ala Java: NoninvertibleTransformException Can be a bad thing: should that first i be upper or lower case?

92. 92 Identifiers, Reserved Words and Keywords C++, Ada and Java allow underscores, while standard Pascal does not FORTRAN originally allowed only 6 chars Reserved Word Name whose definition is part of the syntax of the language Cannot be used by programmer in any other way Most newer languages have reserved words Make parsing easier, since each reserved word will be a different token

93. 93 Identifiers, Reserved Words and Keywords Ex: end if in Ada Interesting extension topic If we extend a language and add new reserved words, we may make some old programs syntactically incorrect Ex: C subprogram using class as an id will not compile with a C++ compiler Ex: Ada 83 program using abstract as an id will not compile with an Ada 95 compiler Keywords To some, keyword ? reserved word Ex: C++, Java

94. 94 Identifiers, Reserved Words and Keywords To others, there is a difference Keywords are only special in certain contexts Can be redefined in other contexts Ex: FORTRAN keywords may be redefined Predefined Identifiers Identifiers defined by the language implementers, which may be redefined cin, cout in C++ real, integer in Pascal predefined classes in Java

95. 95 Identifiers, Reserved Words and Keywords Programmer may wish to redefine for a specific application Ex: Change a Java interface to include an extra method Problem: predefined version no longer applies, so program segments that depend on it are invalid Better to extend a class or compose a new class than to redefine a predefined class Ex: Comparable interface can be implemented as we see fit by a new class

96. 96 Variables Simple (na�ve) definition: a name for a memory location In fact, it is really much more Six attributes Name Address Value Type Lifetime Scope

97. 97 Variables Name: Identifier In most languages the same name may be used for different variables, as long as there is no ambiguity Some exceptions Ex: A method variable name may be declared only once within a Java method

98. 98 Variables Address Location in memory Also called the l-value Some situations that are possible: Different variables with the same name have different addresses Declared in different blocks of the program Same variable has different addresses at different points in time Declared in a subprogram and allocated based on run-time stack We�ll discuss this more shortly

99. 99 Variables Different variables share same address: aliasing Occurs with FORTAN EQUIVALENCE, Pascal and Ada record variants, C and C++ unions, pointer variables, reference parameters Adds to the flexibility of a language, especially with pointers and reference parameters Can also save memory in some situations Many references to a single copy rather than having multiple copies Can be quite problematic if programmer does not handle them correctly Ex: copy constructor and = operator for classes with dynamic components in C++ Ex: shallow copy of arrays in Java We�ll discuss most of these things more in later chapters

100. 100 Variables Type Modern data types include both the structure of the data and the operations that go with it Important in determining legality of some expressions Value Contents of memory locations allocated for that variable Also called the r-value

101. 101 Variables Lifetime Time during which the variable is bound to a specific memory location We�ll discuss this more shortly Scope Section of the program in which a variable is visible Accessible to the programmer/code in that section We�ll discuss this more shortly

102. 102 Binding Binding of variables deals with an association of variable attributes with actual values The time when each of these occurs in a program is the binding time of the attribute Static binding: occurs before runtime and does not change during program execution Dynamic binding: occurs during runtime or changes during runtime

103. 103 Binding Name: Occurs when program is written (chosen by programmer) and for most languages will not change: static Address and Lifetime: A memory location for a variable must be allocated and deallocated at some point in a program The lifetime is the period between the allocation and deallocation of the memory location for that variable

104. 104 Binding We are ignoring bindings associated with a computer�s virtual memory This in fact could cause a variable to be bound and unbound to different memory locations many times throughout the execution of a program Each time the data is swapped or paged out, the location is physically unbound, and then rebound when it is swapped or paged back in We will ignore these issues since they are more related to how the operating system and hardware execute programs than they are to the way the variables are declared and used within the program

105. 105 Binding Text puts lifetimes of variables into 4 categories Static: bound to same memory cell for entire program Ex: static C++ variables Stack-dynamic: bound and unbound based on run-time stack Ex: Pascal, C++, Ada subprogram variables Explicit heap-dynamic: nameless cells that can only be accessed using other variables (pointers) Allocated (and poss. deallocated) explicitly by the programmer Ex: result of new in C++ or Java

106. 106 Binding Implicit heap-dynamic variables Binding of all attributes (except name) changed upon each assignment Much overhead to maintain all of the dynamic information Used in Algol 68 and APL Not used in most newer languages See lifetimes.cpp Value Dynamic by the nature of a variable � can change during run-time

107. 107 Binding Type Dynamic Binding Type associated with a variable is determined at run-time A single variable could have many different types at different points in a program Static Binding Type associated with a variable is determined at compile-time (based on var. declaration) Once declared, type of a variable does not change

108. 108 Binding Advantage of dynamic binding More flexibility in programming Can use same variable for different types Can make operations generic Disadv. of dynamic binding Type-checking is limited and must be done at run-time To understand this better, we must discuss type-checking at more length We�ll return to our last binding topic (scope) afterward

109. 109 Type-checking Type checking Determination that operands and operators in an expression have types that are compatible with each other By compatible we mean the types match or they can be coerced (implicitly converted) to match If they are not compatible, a type-error occurs

110. 110 Type-checking Why is type-checking important? Let�s review programming errors Compilation error: detectable when the program is compiled Usually a syntax error or static semantic error Run-time error: detectable as the program is being run Often an illegal instruction or I/O error Logic error: error in meaning of the program Often only detectable through debugging and/or testing program on known data Program could seem to run perfectly, but produce incorrect results

111. 111 Type-checking We�d like an environment that is not conducive to logic errors Consider dynamic type binding again Assignments cannot be type checked Since type may change, any assignment is legal If object being assigned is an erroneous type, we have a logic error Type checking that is done must be done at run-time This requires type information to be stored and accessed at run-time Must be done in software � i.e. the language must be interpreted Increases both memory and run-time overhead

112. 112 Type-checking Now consider static type binding Since types are set at compile-time, most (but not usually all) type checking can be done at compile-type Assignments can be checked to avoid logic errors Type information does not need to be kept at run-time Program can run in hardware

113. 113 Type-checking STRONGLY TYPED language 2 slightly different definitions Traditional definition: If ALL type checking can be done at compile-time (i.e. statically), a language is strongly typed Sebesta definition: If ALL type errors can always be detected (either at compile-time or at run-time) a language is strongly typed First definition is more reliable but also more restrictive

114. 114 Type-checking No commonly used languages are truly strongly typed, but some come close Let�s look at two: C++ and Ada C++ union construct allows programmer to access same memory as different types with no checking Ada record variants contain a discriminant to determine which type is being used Can only access the type indicated by the discriminant Cannot change the discriminant unless you change the entire record � prevents inconsistency Checking can be turned off, however

115. 115 Type-checking See union.cpp and variant.adb Pascal also has a discriminant, but does not require entire record to be assigned when discriminant is changed Suggests type-safety, but does not enforce it We�ll look more at these types of structures in Chapter 6

116. 116 Type-checking So what does �compatible� mean? We said type-checking involves determining if operands and operators are compatible Different languages define it differently Name compatible (or name equivalent) The types have the same name Easy to check and enforce Simply record the type of each variable in the symbol table Compare when necessary

117. 117 Type-checking Somewhat limiting for programmer Ex. in Pascal A1: array[1..10] of integer; A2: array[1..10] of integer; A2 := A1; { not allowed } Assignment is not legal even though they have the same structure Also cannot pass either variable as a parameter to a subprogram Variables above actually each have an ANONYMOUS TYPE � not name compatible with any other type Generally not a good idea to use

118. 118 Type-checking Structurally compatible (equivalent) The types are compatible if they have the same structure Ex: A1: array[1..10] of integer; A2: array[1..10] of integer; A2 := A1; { this would be allowed } Since both have same size and base type, it works Much more flexible than name compatible, but also much more difficult for compiler and not as clear to programmer Compiler must compare structures of every variable involved in the expression It is not obvious what is and is not compatible

119. 119 Type-checking record X: float; A: array[1..10] of int; end record Can compiler tell that components are reversed? A1:array[1..10] of float; Structure is the same record A: array[1..10] of int; X: float; end record Could it �match� the types with more complex records? A2:array[3..12] of float; Index values are changed

120. 120 Type Checking So how about some languages? Ada: name equivalence but allows subtypes to be compatible with parent types Pascal: almost name equivalence, but considers variables in the same declaration to also be of the same type, and allows one type name to be set equal to another A1, A2: array[1..10] of integer; above not compatible in Ada type newint = integer; C++: name equivalence, but a lot of coercion is done � we will look at coercion later See types.p, types.adb, types.cpp

121. 121 Binding Ok, back to binding Scope (visibility) Static scope: determined at compile-time Dynamic scope: determined at run-time Implications: Most languages have the notion of �local� variables (i.e. within a block or a subprogram) If these are the only variables we use, scope is not important

122. 122 Binding Scope become significant when dealing with non-local variables How and where are these accessible? Most modern languages use static scope If variable is not locally declared, proceed out to the �textual parent� (static parent) of the block/subprogram until the declaration is found Fairly clear to programmer � can look at code to see scope We�ll discuss implementation later

123. 123 Binding Examples: Pascal Subprograms are only scope blocks, but can be nested to arbitrary depth All declarations must be at the beginning of a subprogram Somewhat restrictive, although not the fault of static scope Ada Subprograms can be nested Also allow declare blocks to nest scope within the same subprogram Useful to variable length arrays All declarations must be at the beginning of a subprogram or declare block See arraytest.adb

124. 124 Binding C++ Subprograms CANNOT be nested New declaration blocks can be made with {} Declarations can be made anywhere within a block, and last until the end of the block Interesting note: What about the scope of loop control variables in for loops? Ada always implicitly declares LCVs and scope (and lifetime) is LOCAL to the loop body C++ and Java for is more general and does not require a locally declared LCV, but if included, is also LOCAL to the loop body Wasn�t always the case in C++

125. 125 Binding Dynamic scope Non-local variables are accessed via calls on the run-time stack (going from top to bottom until declaration is found) A non-local reference could be declared in different blocks in different runs Used in APL, SNOBAL4 and through local variables in Perl Flexible but very tricky Difficult for programmer to anticipate different definitions of non-local variables

126. 126 Binding Type-checking must be dynamic, since types of non-locals are not known until run-time More time-consuming See scope.pl

127. 127 Scope, Lifetime and Referencing Environments Concepts of Scope and Lifetime are not comparable Lifetime deals with existence and association of memory WHEN Scope deals with visibility of variable names WHERE However, sometimes they seem to be �the same�

128. 128 Scope, Lifetime and Referencing Environments Ex. stack-dynamic variables in some places Ex. global variables in some situations #include <iostream> using namespace std; int i = 10; int main() { { int i = 11, j = 20; { int i = 12, k = 30; cout << i << " " << j << " " << k << endl; } cout << i << " " << j << endl; } cout << i << endl; }

129. 129 Scope, Lifetime and Referencing Environments More often they are not �the same� Lifetime of stack-dynamic variables continues when a subsequent function is called, whereas scope does not include the body of the subsequent function Lifetime of heap-dynamic variables continues even if they are not accessible at all

130. 130 Scope, Lifetime and Referencing Environments Referencing Environment Given a statement in a program, what variables are visible there? Clearly this depends on the type of scoping used Once we know the scope rules, we can always figure this out From code if static scoping is used From call chains (at run-time) if dynamic scoping is used We will look more at non-local variable access when we discuss subprograms

131. 131 Primitive Data Types Most languages provide these Numeric types Integer Usually 2�s complement Exact representation of the number In some languages (ex. C++) size is not specified Depends on hardware In others (ex. Java) size is specified Regardless of hardware

132. 132 Primitive Data Types Float Usually exponent/mantissa form, each coded in binary Often an approximation of the number Only a limited number of bits for mantissa Decimal point �floats� Many numbers cannot be represented exactly Irrational numbers (ex. PI, e, 21/2) Infinite repeating decimals (ex. 1/3) Some terminating decimals as well (ex. 0.1) Instead we use digits of precision In most new computers, this is defined by IEEE Standard 754 See p. 255 in text See rounding.cpp

133. 133 Primitive Data Types Fixed point Called �Decimal� in text Store rational numbers with a fixed number of decimal places Useful if we need decimal point to stay put Ex. Working with money Can be stored as Binary Coded Decimal � each digit encoded in binary Similar to strings, but we can put 2 digits into one byte, since each digit needs only 4 bits However, 10 digit values do not actually require 4 full bits So memory is somewhat wasted in this representation

134. 134 Primitive Data Types Can also be stored as integers, with an extra attribute in the descriptor Scale factor indicates how many places over to locate decimal point Ex: X = 102.53 and Y = 32.65 Stored as 10253 and 3265 (in binary) with scale factor of 2 Z = X + Y Add the integers and keep the same scale factor (involves normalization if the number of decimal places are not the same � think about this) = 13518 with scale factor 2 = 135.18 Z = X * Y Multiply the integers and add the scale factors, then normalize to the correct number of digits = 33476045 with scale factor 4 = 3347.6045 normalized to 2 digits = 3347.60

135. 135 Primitive Data Types But clearly limited in size by integer size In Java the BigDecimal class (not primitive) stores them as (very long) integers with a 32 bit scale factor � see BigD.java Very Long Numbers Typically NOT primitive types Implemented using arrays via a predefined class (ex. BigInteger and BigDecimal in Java) or via an add-on library (NTL library for C++) Boolean Used in most languages (except C) Values are true or false

136. 136 Primitive Data Types Though stored as integer values (0 or 1), boolean values are typically not compatible with integer values Exception is C++, where all non-zero values are true, 0 is false This adds flexibility, but can cause logic errors Recurring theme! Remember assign.cpp

137. 137 Primitive Data Types Character In early computers, character codes were not standardized Caused trouble for transferring text files Now most computers use ASCII character set (or extended ASCII set) But ASCII is not adequate for international character sets Unicode is used in Java, perhaps in other languages soon 16 bit code

138. 138 Primitive Data Types ASCII is also not the same for Unix platforms and Windows platforms Unix uses only a <LF> while Windows uses a <CR><LF> combination Can cause problems for programs not equipped to handle the difference We can easily convert, however FTP in ASCII mode Simple script to convert

139. 139 More Data Types Strings Not a primitive type in many languages C, C++, Pascal, Ada Primitive type in others BASIC, FORTRAN, Perl Issues to consider: Should a string be considered a single entity, or a collection of characters? Should the length of a string be fixed or variable? Which operations can be used?

140. 140 More Data Types Single entity vs. collection of characters More an issue of access than of structure In languages with no primitive string type, a string is generally perceived as a collection of characters Ex: In C and C++ a string is an array of char If we want we can treat it like any other array If we use operations in string.h (or <cstring>) we can treat it like a string C++ also has string class Ex: In Ada String is a predefined unconstrained array of characters String variables must be constrained Some simple operations can be used (ex. assignment, comparison, concatenation)

141. 141 More Data Types In Perl, string is a primitive type Strings are accessed as single entities We can use functions to get character index values, but strings are stored as single, primitive variable values Many operations can be used (later) In Java, we have two string types Neither is really primitive, since they are classes built using an underlying array of char String Immutable objects (cannot be changed once created) Allows strings to be stored and accessed more efficiently, since multiple objects with the same value can be replaced with a single object (at compile-time � usually done for literals)

142. 142 More Data Types StringBuffer Objects that can be modified after creation More efficient if programmer is altering string values, since new objects do not have to be created for each operation String length 3 variations Static (fixed) length � length of string is set when the object is created and cannot change Limited dynamic length � length of string can vary up to a preset maximum Typically the dynamic aspect is logical � physically the string size is preset, but some of the space may not be used Dynamic length � length of string can vary without bound

143. 143 More Data Types Used in Pascal, Ada String variable is declared to be of a fixed size, with all locations being part of the string It is up to programmer to either pad string at end or somehow ignore unused locations However, Ada improves on Pascal strings by making the String type an unconstrained array String objects must have a fixed length, but they are all of the same type (ex: for params) See astrings.adb and pstrings.p Used in C, C++ (c strings) String variable is of a fixed maximum size, but any number of characters up to the maximum length may be used at any time In C and C++ a special sentinel character (\0) is used to indicate the end of the logical string Implementation is VERY dangerous (what else is new!) See cstrings.cpp

144. 144 More Data Types Used in Perl, Java StringBuffer, C++ string class (among others) Physical length of the string object is automatically adjusted by the system to hold the current length logical string Memory is reallocated if necessary Only limit on string size is amount of memory available Realize that this is not free: each time the string has to be resized the following occurs: New memory of new size is allocated Old data is copied to new memory Variable is rebound to new memory Old memory is discarded It is smart for system to double memory each time so resizing is infrequent We�ll discuss allocation later See Strings.java

145. 145 More Data Types Operations What can we do with our strings? These are affected by how the previous issue (length) If the length must remain constant, we cannot have any operations that would change the length Some possibilities: Slicing (give a substring) Search for a string Index Regular expression matching

146. 146 More Data Types Enumeration Types Used primarily for readability Ex: January, February� vs. 0, 1, � Typically map identifiers to integer values Pascal Identifiers can be used only once I/O not predefined (user must write I/O operations for each new enum. type) Major negative Can be used in for loops and case statements

147. 147 More Data Types Ada Ids can be overloaded as long as the correct definition can be determined through context type colors is (red, blue, yellow, purple, green); type signals is (red, yellow, green); Now the literal green (or yellow or red) is ambiguous in and of itself We must qualify it in the code to the desired type for C in signals�red..signals�green I/O can be done through a generic I/O package � very helpful Allows values to be output without having to write new functions each time

148. 148 More Data Types C++ Enum values are converted to ints by the compiler Lose the type checking that Ada and Pascal provide Java � added Enum types in 1.5 All are a subclass of Enum so are objects Subrange types Improve readability (new name id) and logic-error checking (restricting ranges) Provided in Pascal and Ada

149. 149 More Data Types Arrays Traditionally a homogeneous collection of sequential locations Many issues to consider with arrays � a few are: How are they indexed? How/when is memory allocated? How are multidimensional arrays handled? What operations can be done on arrays?

150. 150 More Data Types Indexing Mapping that takes an array ID and an index value and produces an element within the array Y = A[X]; Range of index values depends on two things: How many items are in the array? This depends on how allocation is allowed in a language We will discuss more with memory allocation Are any bounds implicit? C, C++, Java arrays start at 0 Ada, Pascal can start anywhere

151. 151 More Data Types Whether range of indexes is static or dynamic, the actual subscripts must be evaluated at run-time Subscript could be a variable or expression whose value is not known until run-time If value is not within the range of the array, a range-error occurs VERY COMMON logic error made in programs Checked in Ada, Pascal and Java Not checked in FORTRAN, C, C++, much to the chagrin of programmers Q: Why aren�t range-errors checked in C, C++?

152. 152 More Data Types Memory allocation Remember discussion of allocation during Binding lecture Basically those bindings hold, with some slight additions Static: size of array is fixed and allocation is done at compile-time Fixed stack-dynamic: size of array is fixed but allocation is done on run-time stack during execution Pascal arrays "Normal" C++ arrays

153. 153 More Data Types Stack-dynamic: size of array is determined at run-time and array is allocated on run-time stack; however, once allocated, size is fixed Ada arrays See prev. Ada array example Heap-dynamic: Explicit: programmer allocates and deallocates arrays C++, Java, FORTRAN 90 dynamic arrays Implicit: array is resized automatically as needed by system Perl, Java ArrayList See examples on board

154. 154 More Data Types Multiple Dimension Arrays 2-D Array: array of vectors � plane � Matrix 3-D Array: array of matrices or planes Higher dimension arrays are not so easily envisioned, but in most languages are legal Subscripts for each dimension can be different ranges, and, in fact even different types Ex. Pascal: type matrix = array[1..10, �A�..�Z�] of real; Indexing function is a bit more complicated, as we will see next

155. 155 More Data Types Implementing Arrays (1-D) Two questions we need to answer: What information do we need at compile-time? What information do we need at run-time? The answers to these questions depend somewhat on other language properties If dynamic type-binding is used, most information must be kept at run-time Even with static type-binding, some information may be needed at run-time

156. 156 More Data Types Let�s assume static type-binding, since most languages use this What do we need to know at compile-time? Element type Index type Above two needed for type checking Element size Index range (lower and upper bounds) Array base address Above items are needed to construct the indexing function for the array Let�s see how the indexing function is created Given an array, A, we want a function to return the Lvalue (address) of ith location A

157. 157 More Data Types Assume: E = Element size LB = Lower index bound UB = Upper index bound S = start (base) address of array Lvalue(A[i]) = S + (i � LB) � E = (S � LB � E) + (i � E) Note that once S is known (array has been bound to memory), left part of the equation is constant So for each array access, only (i � E) must be calculated If i is a constant, this too can be precalculated (ex. A[5])

158. 158 More Data Types What do we need to know at run-time? This depends on what kind of checking we are doing Ex: C/C++ need only the start address (S) of the array and the element size, (E) LB is always 0 UB is not checked Ex: Ada, Pascal need more: LB and UB needed for bounds checking Run-time information for an array is typically stored in a run-time descriptor (sometimes called a dope-vector) This may be stored in the same memory block as the data itself (ex. at the beginning), or elsewhere in memory

159. 159 More Data Types Implementing Arrays (multi-D) This is more complicated than 1-D, since computer memory is typically linear Thus multi-D arrays are still stored physically as 1-D arrays Language must allow them to be accessed in a multi-D way Ex. Two-D arrays In most languages stored in row-major order Line up rows end-to-end to produce the linear physical array

160. 160 More Data Types Column-major order also exists � used in FORTRAN Index function for 2-D arrays is a bit more complicated than for 1-D, but is still not that hard to calculate lvalue(A[i,j]) = S + (i � LB1)�ROWLENGTH + (j � LB2)�E where ROWLENGTH = (UB2 � LB2+ 1)�E Do example Given int A[10][5] calculate A[4][3]

161. 161 More Data Types Can we access arrays in any other way? In some languages (ex: C, C++) we can use pointers for array access In this case rather than calculating the index of the location as an offset from the base address, we move the pointer to the address of the location we want to access For sequential access of the items in an array, this is much more efficient than using traditional indexing Of course, it is also dangerous! See ptrarrays.cpp

162. 162 More Data Types Array Operations What can be done to array as a whole? Assignment Allowed in languages in which array is actually a data type (ex. Ada, Pascal, Java) Usually bitwise copy between arrays (shallow copy) C/C++ do not allow assignment since array variables are simply (constant) pointers Comparison Equal and not equal allowed in Ada With Java references are compared More complex operations are allowed in some languages (ex. APL)

163. 163 More Data Types Associative Arrays Instead of being indexed on integers (or simple enumeration values), index on arbitrary keys (usually strings) Ex: Perl hash (also exists in Smalltalk and Java) and PHP arrays Java and other langs also have Hashtable classes Typically done following hashing procedures Hash function used to map key to an index, mod the table size Keys are dispersed in a random fashion, to reduce collisions

164. 164 More Data Types Typically in these language-implemented hashes, the table size is not shown to the programmer Programmer does not need to know details in order to use it Usually it grows and shrinks dynamically as the table fills or becomes smaller Resizing is expensive � all items must be rehashed into new sized table Would like to not have to resize too often Doubling size with each expansion is a good idea

165. 165 More Data Types Records Heterogeneous collections of data, accessed by component names Very useful in creating structured data types Forerunners of objects (do not have the operations, just the data) Access fairly uniform across languages with dot notation Fields accessed by name since they may have varying size

166. 166 More Data Types Pointers Variables that store addresses of other variables In most high-level languages, their primary use is to allow access of dynamic memory Exception: C, where pointers are required for �reference� parameters and where they are also used heavily for array access We�ll talk about parameters later

167. 167 More Data Types To access memory being pointed to, a pointer is dereferenced Usually explicit: C++ Ex: *P = 6.02; Implicit in some contexts: Ada Ex: rec_ptr.field

168. 168 More Data Types Pointer Issues: Should pointers be typed like other variables? How do the scope and lifetime of the pointer variable relate (if at all) to the scope and lifetime of the heap-dynamic variable? What are the safety/access issues a programmer using pointers must be concerned with? How is heap-dynamic memory maintained? What are reference types and how are they similar/different to pointer types?

169. 169 More Data Types Types Most languages with pointers require them to be typed This allows type-checking of heap-dynamic memory Scope/Lifetime/Safety/Acess In most languages the scope and lifetime of heap-dynamic variables are distinct from those of the pointer variables If pointer variable is a stack-dynamic variable, its scope and lifetime are as we discussed previously Heap-dynamic variables are also as we discussed previously

170. 170 More Data Types Thus we have the potential for 2 different problems: Lifetime of pointer variable accessing a heap-dynamic variable ends, but heap-dynamic variable still exists Now heap-dynamic variable is no longer accessible This problem can also occur by simply reassigning the pointer variable Now we have formed GARBAGE � heap-dynamic variable that can no longer be accessed Different languages handle garbage in different ways � we will discuss them shortly

171. 171 More Data Types Heap-dynamic variable is in some way deallocated (returned to system), but pointer variable is still storing its address Now address stored in pointer is invalid Can cause problems if it is dereferenced, especially if the memory has been reallocated for some other use This is a dangling reference Can be catastrophic to a program, as most C++ programmers know To deal with these problems, we must discuss how heap-dynamic memory is maintained

172. 172 Heap Maintenance There are two issues to consider for heap-dynamic memory maintenance How are blocks of memory stored and accessed? What is the structure of the heap itself? How is memory allocated and deallocated? What protocols are used?

173. 173 Heap Maintenance Structure of the heap Simplest option is to make it a linked-list of identical cells Each cell is the same size and points to the next cell in the list (so it contains a pointer) Available cells are in a free list and memory is always allocated from that list Works in a language like LISP, but a language like C++ requires large dynamic arrays and objects of different sizes Also cells do not always have pointers

174. 174 Heap Maintenance Another option is to allow the cell sizes within the heap to vary More complicated than single-size cells Free space must be maintained differently Initially entire heap is stored in a single block As memory requests are granted, chunks of the heap are partitioned off First Fit, Best Fit, Worst Fit As memory is returned and more is given out, the heap begins to fragment Eventually, fragments are too small to be of use Common problem in CS with many things (ex: disk) We must try to reunify small adjacent blocks into larger blocks

175. 175 Heap Maintenance Heap Compaction Rearrangement of heap to create larger free memory blocks May become necessary after long use and much fragmentation Partial compaction No active blocks are moved Compaction occurs through combination of adjacent free blocks Fairly simple to do � each time a block is freed we can check for adjacent free blocks Full compaction Active blocks are moved to create more contiguous free space Complex and time-consuming � if a block is moved ALL references to it must be sought out and changed How?

176. 176 Heap Maintenance Memory Allocation/Deallocation 2 basic strategies: Make the programmer responsible for allocation and deallocation Everything is explicit Let the system be responsible for deallocation Reference counts (eager) Garbage collection (lazy)

177. 177 Heap Maintenance Programmer is responsible This is the approach taken by C, C++ and (more or less) Pascal C: malloc and free C++: new and delete Pascal: new and dispose Very flexible � programmer has complete control Very efficient � reasonable system overhead (unless full compaction is used)

178. 178 Heap Maintenance Very dangerous � programmer can get into serious trouble Remember two problems we mentioned before � dangling references and garbage Avoidance of these problems is completely up to the programmer If programmer does NOT avoid these, serious problems can result Dangling references can lead to memory corruption and program crash Garbage creates a memory leak that eventually could consume all of the heap space So programmer must be very careful and test/debug the code extensively

179. 179 Heap Maintenance System is responsible Depending on language and variables, programmer either explicitly or implicitly allocates dynamic memory Explicit: new in Java Implicit: append to StringBuffer in Java, many operations in Lisp But now the deallocation is either partially or fully implemented implicitly by the system Idea is to avoid the problems just shown: dangling references and garbage

180. 180 Heap Maintenance Reference Counts For each block of memory allocated from the heap, a counter is kept of the number of references to that block Any assignment of a pointer to that address increments the counter Any change of assignment from that address, or any delete() decrements the counter If counter reaches 0, deallocate memory delete or dispose should also set pointer in question back to NULL Note now (for the most part) both dangling references and garbage are eliminated

181. 181 Heap Maintenance p = new float(3.5); q = p; r = q; // reference count to float = 3 delete(q); // rather than actually deleting the float, the // count is decremented and q is set to NULL r = new float(8.5); p = new float(6.0); // ref. count to orig. float now 0 and it is // removed Looks great! So why doesn�t C++ use this?

182. 182 Heap Maintenance Implementing Reference Counts Cost of reference counts is high Statements that were simple and cheap without reference counts become complex and expensive with them Ex: q = p; from previous slide In C++, copy the address With reference counts: Find q�s previous address and decrement its reference count (deleting if now equal to 0) Copy address from p to q Increment reference count of new address The counts themselves must also be stored Reference counts do not also (easily) work in all situations

183. 183 Heap Maintenance Ex: p = new node; p -> next = new node; p -> next -> next = p; delete p; Now we have a garbage loop Each dynamic node has a reference to it, but neither is accessible Clearly, circular references like this could be bigger Can be resolved but not easily Reference counts are not used much in typical high-level languages, but are used in some parallel-processing applications Cost of using reference counts is spread out

184. 184 Heap Maintenance Garbage Collection Different philosophy than reference counts Once allocated, memory remains allocated indefinitely, whether accessible or not Dangling references not possible Garbage is allowed to accumulate Eventually, if a program runs long enough and consumes enough memory, the program will run out With new computers and virtual memory this may not always occur, especially in programs with only a limited number of objects But if it does occur it is a problem � there is no more memory available to allocate

185. 185 Heap Maintenance When memory runs out, garbage must be collected so that it can be reused How is this done? Typical technique is �mark and sweep� Mark � each item allocated has a garbage bit. This bit is initialized to 1 for all items, indicating that they are ALL garbage. Then all �active� items are found, and marked as non-garbage by setting the bit back to 0 Sweep � unmarked items (bits still 1) are returned to the free list Again we have eliminated both problems Dangling references don�t exist, since a reference would cause the item to be marked as active Garbage accumulates but is reclaimed when needed

186. 186 Heap Maintenance Implementing Garbage Collection Which operation seems more difficult, mark or sweep? It is not at all obvious or simple to identify active memory We want to make sure we don�t miss anything, or else we can cause serious problems On the other hand, we don�t want to leave garbage behind either So how can we reliably detect active memory? As we saw with reference counts, it is not enough to have a reference to it

187. 187 Heap Maintenance We can do this in a systematic (time-consuming) way, assuming three conditions are true: Any active element must be reachable through a chain of pointers originating outside of heap Every pointer outside the heap that points into the heap can be identified Pointer fields within active elements that point to other heap elements must be able to be identified Basic idea: Start with pointers outside the heap, trace them to active heap elements within the heap, and trace those elements to others, etc.

188. 188 Heap Maintenance The necessary conditions are not always met Ex. Met in LISP, Java, but not C or C++ C and C++ allow too much flexibility with pointer arithmetic Ex: pointer P may not have a valid address of an element in the heap, but maybe P+offset does have a valid address p = new node; p += 4; If garbage collector cannot �find� this node amongst the active nodes it will be deleted, and p-4 will be a dangling reference Note: Time required for garbage collection is inversely proportional to amount of garbage that is reclaimed Discuss

189. 189 Heap Maintenance Ada memory management Remember a goal of Ada is to make the language as reliable as possible How to allow pointers but maintain reliability? Ada prevents dangling references by having no explicit delete in the language Scope of dynamic memory is not longer than that of pointers that point to it Prevents some garbage, but not all Allows for garbage collection, but does not require it

190. 190 More Data Types Reference Types We can think of a reference type as a pointer that is implicitly dereferenced, with restrictions on its use No explicit deferencing No arithmetic In C++ they are also constant (cannot change the address) Originally used for parameters and some return values in C++ functions void foo(float ¶m)

191. 191 More Data Types In Java they are used for all non primitive values Have the benefits of allowing access to dynamic memory in the heap Since no arithmetic is allowed, garbage collection can be used and no dangling references occur But we still have to be careful, since they are addresses Shallow vs. deep copy Comparisons

Course Notes for CS1621 Structure of Programming Languages Part A By John C. Ramirez Department of Computer Science Univ

Course Notes for CS1621 Structure of Programming Languages Part A By John C. Ramirez Department of Computer Science Univ

Presentation Transcript

Course Notes for CS1621 Structure of Programming Languages Part B By John C. Ramirez Department of Computer Science Univ

Programming Languages: The Essence of Computer Science

Department of Computer Science

Department of Computer Science

Department of Computer Science

C# Programming Course – Part I

C# Programming Course – Part I

Department of computer science

Department of Computer Science

Department of Computer Science

Advanced Structure of Programming Languages

Programming Languages Structure

STRUCTURE OF PROGRAMMING LANGUAGES

DEPARTMENT OF COMPUTER SCIENCE

Department of Computer Science

Department of Computer Science

Department of Computer Science

STRUCTURE OF PROGRAMMING LANGUAGES

Department of Computer Science

Department of Computer Science