1.92k likes | 2.38k Views
2. These notes are intended for use by students in CS1621 at the University of Pittsburgh and no one elseThese notes are provided free of charge and may not be sold in any shape or formMaterial from these notes is obtained from various sources, including, but not limited to, the textbooks:Concept
E N D
1. Course Notes forCS1621 Structure of Programming LanguagesPart AByJohn C. RamirezDepartment of Computer ScienceUniversity of Pittsburgh
2. 2 These notes are intended for use by students in CS1621 at the University of Pittsburgh and no one else
These notes are provided free of charge and may not be sold in any shape or form
Material from these notes is obtained from various sources, including, but not limited to, the textbooks:
Concepts of Programming Languages, Seventh Edition, by Robert W. Sebesta (Addison Wesley)
Programming Languages, Design and Implementation, Fourth Edition, by Terrence W. Pratt and Marvin V. Zelkowitz (Prentice Hall)
Compilers Principles, Techniques, and Tools, by Aho, Sethi and Ullman (Addison Wesley)
3. 3 Goals of Course To survey the various programming languages, their purposes and their histories
Why do we have so many languages?
How did these languages develop?
Are some languages “better” than others for some things?
To examine methods for describing language syntax and semantics
4. 4 Goals of Course Syntax indicates structure of program code
How can language designer specify this?
How can programmer learn language?
How can compiler recognize this?
Lexical analysis
Parsing (syntax analysis)
Brief discussing of parsing techniques
Semantics indicate meaning of the code
What code will actually do
Can we effectively do this in a formal way?
Static semantics
Dynamic semantics
5. 5 Goals of Course To examine some language features and constructs and how they are used and implemented in various languages
Variables and constants
Types, binding and type checking
Scope and lifetime
Data Types
Primitive types
Array types
Structured data types
6. 6 Goals of Course Pointer (reference) types
Assignment statements and expressions
Operators, precedence and associativity
Type coercions and conversions
Boolean expressions and short-circuit evaluation
Control statements
Selection
Iteration
Unconditional branching – goto
7. 7 Goals of Course Process abstraction: procedures and functions
Parameters and parameter-passing
Generic subprograms
Nonlocal environments and side-effects
Implementing subprograms
Subprograms in static-scoped languages
Subprograms in dynamic-scoped languages
8. 8 Goals of course Data abstraction and abstract data types
Object-oriented programming
Design issues
Implementations in various object-oriented languages
Concurrency
Concurrency issues
Subprogram level concurrency
Implementations in various languages
Statement level concurrency
9. 9 Goals of Course Exception handling
Issues and implementations
IF TIME PERMITS
Functional programming languages
Logic programming languages
10. 10 Language Development Issues Why do we have high-level programming languages?
Machine code is too difficult for us to read, understand and debug
Machine code is not portable between architectures
Why do we have many high-level programming languages?
Different people and companies developed them
11. 11 Language Development Issues Different languages are either designed to or happen to meet different programming needs
Scientific applications
FORTRAN
Business applications
COBOL
AI
LISP, Scheme (& Prolog)
Systems programming
C
Web programming
Perl, PHP, Javascript
“General” purpose
C++, Ada, Java
12. 12 Language Development Issues Programming language qualities and evaluation criteria
Readability
How much can non-author understand logic of code just by reading it?
Is code clear and unambiguous to reader?
These are often subjective, but sometimes is is fairly obvious
Examples of features that help readability:
Comments
Long identifier names
Named constants What are some issues that are important when designing, choosing or evaluating programming languages?
Readability example:
C++:
if (score <= 100)
if (score >= 90)
cout << “A” << endl;
else
cout << “A+”;
Ada:
if (score <= 100) then
if (score >= 90)
put (“A”);
end if;
else
put (“A+”);
end if;What are some issues that are important when designing, choosing or evaluating programming languages?
Readability example:
C++:
if (score <= 100)
if (score >= 90)
cout << “A” << endl;
else
cout << “A+”;
Ada:
if (score <= 100) then
if (score >= 90)
put (“A”);
end if;
else
put (“A+”);
end if;
13. 13 Language Development Issues Clearly understood control statements
Language orthogonality
Simple features combine in a consistent way
But it can go too far, as explained in the text about Algol 68
Writability
Not dissimilar to readability
How easy is it for programmer to use language effectively?
Can depend on the domain in which it is being used
Ex: LISP is very writable for AI applications but would not be so good for systems programming
Also somewhat subjective Orthogonality example:
Class declaration in Java allows programmer to create new type
Array declaration in Java can be of any Java type
So:
Now programmer can make arrays of primitive types, arrays of objects, arrays of arrays, etc.Orthogonality example:
Class declaration in Java allows programmer to create new type
Array declaration in Java can be of any Java type
So:
Now programmer can make arrays of primitive types, arrays of objects, arrays of arrays, etc.
14. 14 Language Development Issues Examples of features that help writability:
Clearly understood control statements
Subprograms
Also orthogonality
Reliability
Two different ideas of reliability
Programs are less susceptible to logic errors
Ex: Assignment vs. comparison in C++
See assign.cpp and assign.java
Programs have the ability to recover from exceptional situations
Exception handling – we will discuss more later Ex: C error with = meant as comparison
Ex: Compare to JavaEx: C error with = meant as comparison
Ex: Compare to Java
15. 15 Language Design Issues Many factors influence language design
Architecture
Most languages were designed for single processor “von Neumann” type computers
CPU to execute instructions
Data and instructions stored in main memory
General language approach
Imperative languages
Fit well with von Neumann computers
Focus is on variables, assignment, selection and iteration
Examples: FORTRAN, Pascal, C, Ada, C++, Java Draw picture of von Neumann machine on board
Draw a few statements (assignment, etc)Draw picture of von Neumann machine on board
Draw a few statements (assignment, etc)
16. 16 Language Design Issues Imperative language evolution
Simple “straight-line” code
Top-down design and process abstraction
Data abstraction and ADTs
Object-oriented programming
Some consider object-oriented languages not to be imperative, but most modern oo languages have imperative roots (ex. C++, Java)
Functional languages
Focus is on function and procedure calls
Mimics mathematical functions
Less emphasis on variables and assignment
In strictest form has no iteration at all – recursion is used instead
Examples: LISP, Scheme
See example Give short example of a LISP program segment
(equal ‘(a b) (car ‘ ((a b) c d e)))Give short example of a LISP program segment
(equal ‘(a b) (car ‘ ((a b) c d e)))
17. 17 Language Design Issues Logic programming languages
Symbolic logic used to express propositions, rules and inferences
Programs are in a sense theorems
User enters a proposition and system uses programmer’s rules and propositions in an attempt to “prove” it
Typical outputs:
Yes: Proposition can be established by program
No: Proposition cannot be established by program
Example: Prolog – see example program
Cost
What are the overall costs associated with a given language?
How does the design affect that cost? Give simple prolog exampleGive simple prolog example
18. 18 Language Design Issues Training programmers
How easy is it to learn?
Writing programs
Is language a good fit for the task?
Compiling programs
How long does it take to compile programs?
This is not as important now as it once was
Executing programs
How long does program take to run?
Often there is a trade-off here
Ex: Java is slower than C++ but it has many run-time features (array bounds checking, security manager) that are lacking in C++
19. 19 Language Implementation Issues How is HLL code processed and executed on the computer?
Compilation
Source code is converted by the compiler into binary code that is directly executable by the computer
Compilation process can be broken into 4 separate steps:
Lexical Analysis
Breaks up code into lexical units, or tokens
Examples of tokens: reserved words, identifiers, punctuation
Feeds the tokens into the syntax analyzer
20. 20 Language Implementation Issues Syntax Analysis
Tokens are parsed and examined for correct syntactic structure, based on the rules for the language
Programmer syntax errors are detected in this phase
Semantic Analysis/Intermediate Code Generation
Declaration and type errors are checked here
Intermediate code generated is similar to assembly code
Optimizations can be done here as well, for example:
Unnecessary statements eliminated
Statements moved out of loops if possible
Recursion removed if possible
21. 21 Language Implementation Issues Code Generation
Intermediate code is converted into executable code
Code is also linked with libraries if necessary
Note that steps 1) and 2) are independent of the architecture – depend only upon the language – Front End
Step 3) is somewhat dependent upon the architecture, since, for example, optimizations will depend upon the machine used
Step 4) is clearly dependent upon the architecture – Back End
22. 22 Language Implementation Issues Interpreting
Program is executed in software, by an interpreter
Source level instructions are executed by a “virtual machine”
Allows for robust run-time error checking and debugging
Penalty is speed of execution
Example: Some LISP implementations, Unix shell scripts and Web server scripts
23. 23 Language Implementation Issues Hybrid
First 3 phases of compilation are done, and intermediate code is generated
Intermediate code is interpreted
Faster than pure interpretation, since the intermediate codes are simpler and easier to interpret than the source codes
Still much slower than compilation
Examples: Java and Perl
However, now Java uses JIT Compilation also
Method code is compiled as it is called so if it is called again it will be faster
24. 24 Brief, Incomplete PL History Early 50’s
Early HLL’s started to emerge
FORTRAN
Stands for FORmula TRANslating system
Developed by a team led by John Backus at IBM for the IBM 704 machine
Successful in part because of support by IBM
Designed for scientific applications
The root of the imperative language tree
25. 25 Brief PL History Lacked many features that we new take for granted in programming languages:
Conditional loops
Statement blocks
Recursive abilities
Many of these features were added in future versions of FORTRAN
FORTRAN II, FORTRAN IV, FORTRAN 77, FORTRAN 90
Had some interesting features that are now obsolete
COMMON, EQUIVALENCE, GOTO
We may discuss what these are later
26. 26 Brief PL History Late 50’s
COBOL
COmmon Business Oriented Language
Developed by US DoD
Separated data and procedure divisions
But didn’t allow functions or parameters
Still widely used, due in part to the large cost of rewriting software from scratch
Big companies would rather maintain COBOL programs than rewrite them in a different language
27. 27 Brief PL History LISP
LISt Processing
Developed by John McCarthy of MIT
Functional language
Good for symbolic manipulation, list processing
Had recursion and conditional expressions
Not in original FORTRAN
At one time used extensively for AI
Today most widely used version, COMMON LISP, has included some imperative features
28. 28 Brief PL History ALGOL
ALGOL 58 and then ALGOL 60 both designed by international committee
Goals for the language:
Syntax should be similar to mathematical notation and readable
Should be usable for algorithms in publications
Should be compilable into machine code
Included some interesting features
Pass by value and pass by name (wacky!) parameters
Recursion (first in an imperative language)
Dynamic arrays
Block structure and local variables
29. 29 Brief PL History Introduced Backus-Naur Form (BNF) as a way to describe the language syntax
Still commonly used today, but not well-accepted at the time
Never widely used, but influenced virtually all imperative languages after it
30. 30 Brief PL History Late 60’s
Simula 67
Designed for simulation applications
Introduced some interesting features
Classes for data abstraction
Coroutines for re-entrant subprograms
ALGOL 68
Emphasized orthogonality and user-defined data types
Not widely used
31. 31 Brief PL History 70’s
Pascal
Developed by Nicklaus Wirth
No major innovations, but due to its simplicity and emphasis of good programming style, became widely used for teaching
C
Developed by Dennis Ritchie to help implement the Unix operating system
Has a great deal of flexibility, esp. with types
Incomplete type checking
32. 32 Brief PL History Void pointers
Coerces many types
Many programmers (esp. systems programmers) love it
Language purists hate it
Easy to miss logic errors
Prolog
Logic programming
We discussed it a bit already
Still used somewhat, mostly in AI
May discuss in more detail later
33. 33 Brief PL History 80’s
Ada
Developed over a number of years by DoD
Goal was to have one language for all DoD applications
Especially for embedded systems
Contains some important features
Data encapsulation with packages
Generic packages and subprograms
Exception handling
Tasks for concurrent execution
We will discuss some of these later
34. 34 Brief PL History Very large language – difficult to program reliably, even though reliability was one of its goals!
Early compilers were slow and error-prone
Did not have the widespread general use that was hoped
Eventually the government stopped requiring it for DoD applications
Use faded after this
Not used widely anymore
Ada 95 added object-oriented features
Still wasn't used much, especially with the advent of Java and other OO languages
35. 35 Brief PL History Smalltalk
Designed and developed by Alan Kay
Concepts developed in 60’s, but language did not come to fruition until 1980
Designed to be used on a desktop computer 15 years before desktop computers existed
First true object-oriented language
Language syntax is geared toward objects
“messages” passed between objects
“methods” are invoked as responses to messages
Always dynamically bound
All classes are subclasses of Object
Also included software devel. environment
Had large impact on future OOLs, esp. Java
36. 36 Brief PL History C++
Developed largely by Bjarne Stroustrup as an extension to C
Backward compatible
Added object-oriented features and some additional typing features to improve C
Very powerful and very flexible language
But still has reliability problems
Ex. no array bounds checking
Ex. dynamic memory allocation
Widely used and likely to be used for a while longer
37. 37 Brief PL History Perl
Developed by Larry Wall
Takes features from C as well as scripting languages awk, sed and sh
Some features:
Regular expression handling
Associative arrays
Implicit data typing
Originally used for data extraction and report generation
Evolved into the archetypal Web scripting language
Has many proponents and detractors
38. 38 Brief PL History 90’s
Java
Interestingly enough, just like Ada, Java was originally developed to be used in embedded systems
Developed at Sun by a team headed by James Gosling
Syntax borrows heavily from C++
But many features (flaws?) of C++ have been eliminated
No explicit pointers or pointer arithmetic
Array bounds checking
Garbage collection to reclaim dynamic memory
39. 39 Brief PL History Object model of Java actually more resembles that of Smalltalk rather than that of C++
All variables are references
Class hierarchy begins with Object
Dynamic binding of method names to operations by default
But not as pure in its OO features as Smalltalk, due to its imperative control structures
Interpreted for portability and security
Also JIT compilation now
Growing in popularity, largely due to its use on Web pages
40. 40 Brief PL History 00's (aughts? oughts? naughts?)
See http://www.randomhouse.com/wotd/index.pperl?date=19990803
C#
Main roots in C++ and Java with some other influences as well
Used with the MS .NET programming environment
Some improvements and some deprovements compared to Java
Likely to succeed given MS support
41. 41 Program Syntax Recall job of syntax analyzer:
Groups (parses) tokens (fed in from lexical analyzer) into meaningful phrases
Determines if syntactic structure of token stream is legal based on rules of the language
Let’s look at this in more detail
How does compiler “know” what is legal and what is not?
How does it detect errors?
42. 42 Program Syntax To answer these questions we must look at programming language syntax in a more formal way
Language:
Set of strings of lexemes from some alphabet
Lexemes are the lowest level syntactic elements
Lexemes are made up of characters, as defined by the character set for the language
43. 43 Program Syntax Lexemes are categorized into different tokens and processed by the lexical analyzer
Ex:
Lexemes: if, (, width, <, height, ), {, cout, <<, width, <<, endl, ;, }
Tokens: iftok, lpar, idtok, lt, idtok, rpar, lbrace, idtok, llt, idtok, llt, idtok, semi, rbrace
Note that some tokens correspond to single lexemes (ex. iftok) whereas some correspond to many (ex. idtok)
44. 44 Program Syntax How do we formally define a language?
Assume we have a language, L, defined over an alphabet, ?.
2 related techniques:
Recognition
An algorithm or mechanism, R, will process any given string, S, of lexemes and correctly determine if S is within L or not
Not used for enumeration of all strings in L
Used by parser portion of compiler
45. 45 Program Syntax Generation
Produces valid “sentences” of L
Not as useful as recognition for compilation, since the valid “sentences” could be arbitrary
More useful in understanding language syntax, since it shows how the sentences are formed
Recognizer only says if sentence is valid or not – more of a trial and error technique
46. 46 Program Syntax So recognizers are what compilers need, but generators are what programmers need to understand language
Luckily there are systematic ways to create recognizers from generators
Thus the programmer “reads” the generator to understand the language, and a recognizer is created from the generator for the compiler
47. 47 Language Generators Grammar
A mechanism (or set of rules) by which a language is generated
Defined by the following:
A set of non-terminal symbols, N
Do not actually appear in strings
A set of terminal symbols, T
Appear in strings
A set of productions, P
Rules used in string generation
A starting symbol, S
48. 48 Language Generators Noam Chomsky described four classes of grammars (used to generate four classes of languages) – Chomsky Hierarchy
)Unrestricted
Context-sensitive
Context-free
Regular
More info on unrestricted and context-sensitive grammars in a theory course
The last two will be useful to us
49. 49 Language Generators Regular Grammars
Productions must be of the form:
<non> ? <ter><non> | <ter>
where <non> is a nonterminal, <ter> is a terminal, and | represents either or
Can be modeled by a Finite-State Automaton (FSA)
Also equivalent to Regular Expressions
Provide a model for building lexical analyzers
50. 50 Language Generators Have following properties (among others)
Can generate strings of the form ?n, where ? is a finite sequence and n is an integer
Pattern recognition
Can count to a finite number
Ex. { an | n = 85 }
But we need at least 86 states to do this
Cannot count to arbitrary number
Note that { an } for any n (i.e. 0 or more occurrences) is easy – do not have to count
Important to realize that the number of states is finite: cannot recognize patterns with an arbitrary number of possibilities
51. 51 Language Generators Example: Regular grammar to recognize Pascal identifiers (assume no caps)
N = {Id, X} T = {a..z, 0..9} S = Id
P =
Id ? aX | bX | … | a | b | … | z
X ? aX | bX | … | 0X | … | 9X | a | … | z | 0 | … | 9
Consider equiv. FSA:
52. 52 Language Generators Example: Regular grammar to generate a binary string containing an odd number of 1s
N = {A,B} T = {0,1} S = A P =
A ? 0A | 1B | 1
B ? 0B | 1A | 0
Example: Regular grammars CANNOT generate strings of the form anbn
Grammar needs some way to count number of a’s and b’s to make sure they are the same
Any regular grammar (or FSA) has a finite number, say k, of different states
If n > k, not possible
53. 53 Language Generators If we could add a “memory” of some sort we could get this to work
Context-free Grammars
Can be modeled by a Push-Down Automaton (PDA)
FSA with added push-down stack
Productions are of the form:
<non> ? ?, where <non> is a nonterminal and ? is any sequence of terminals and nonterminals
note rhs is more flexible now
54. 54 Language Generators So how to generate anbn ? Let a=0, b=1
N = {A} T = {0,1} S = A P =
A ? 0A1 | 01
Note that now we can have a terminal after the nonterminal as well as before
Can also have multiple nonterminals in a single production
Example: Grammar to generate sets of balanced parentheses
N = {A} T = {(,)} S = A P =
A ? AA | (A) | ()
55. 55 Language Generators Context-free grammars are also equivalent to BNF grammars
Developed by Backus and modified by Naur
Used initially to describe Algol 60
Given a (BNF) grammar, we can derive any string in the language from the start symbol and the productions
A common way to derive strings is using a leftmost derivation
Always replace leftmost nonterminal first
Complete when no nonterminals remain
56. 56 Language Generators Example: Leftmost derivation of nested parens: (()(()))
A ? (A)
? (AA)
? (()A)
? (()(A))
? (()(()))
We can view this derivation as a tree, called a parse tree for the string
57. 57 Language Generators Parse tree for (()(()))
58. 58 Language Generators If, for a given grammar, a string can be derived by two or more different parse trees, the grammar is ambiguous
Some languages are inherently ambiguous
All grammars that generate that language are ambiguous
Many other languages are not themselves ambiguous, but can be generated by ambiguous grammars
It is generally better for use with compilers if a grammar is unambiguous
Semantics are often based on syntactic form
59. 59 Language Generators Ambiguous grammar example: Generate strings of the form 0n1m, where n,m >= 1
N = {A,B,C} T = {0,1} S = A P =
A ? BC | 0A1
B ? 0B | 0
C ? 1C | 1
Consider the string: 00011
60. 60 Language Generators We can easily make this grammar unambiguous:
Remove production: A ? 0A1
Note that nonterminal B can generate an arbitrary number of 0s and nonterminal C can generate an arbitrary number of 1s
Now only one parse tree
61. 61 Language Generators Let’s look at a few more examples
Grammar to generate: {WWR | W ? {0,1} }
N = {A} T = {0,1} S = A P = ?
Grammar to generate: strings in {0,1} of the form WX such that |W| = |X| but W != X
This one is a little tricker
How to approach this problem?
We need to guarantee two things
Overall string length is even
At least one bit differs in the two “halves”
62. 62 Language Generators See board
Ok, now how do we make a grammar to do this?
Make every string (even length) the result of two odd-length strings appended to each other
Assume odd-length strings are Ol and Or
Make sure that either
Ol has a 1 in the middle and Or has a 0 in the middle or
Ol has a 0 in the middle and Or has a 1 in the middle
Productions:
63. 63 Language Generators Let’s look at an example more relevant to programming languages:
Grammar to generate simple assignment statements in a C-like language (diff. from one in text):
<assig stmt> ::= <var> = <arith expr>
<arith expr> ::= <term> | <arith expr> + <term> | <arith expr> - <term>
<term> ::= <primary> | <term> * <primary> | <term> / <primary>
<primary> ::= <var> | <num> | (<arith expr>)
<var> ::= <id> | <id>[<subscript list>]
<subscript list> ::= <arith expr> | <subscript list>, <arith expr>
64. 64 Language Generators Parse tree for: X = (A[2]+Y) * 20
65. 65 Language Generators Wow – that seems like a very complicated parse tree to generate such a short statement
Extra non-terminals are often necessary to remove ambiguity
Extra non-terminals are often necessary to create precedence
Precedence in previous grammar has * and / higher than + and /
They would be “lower” in the parse tree
“LOWER” ABOVE IS CORRECT
What about associativity
Left recursive productions == left associativity
Right recursive productions == right associativity
66. 66 Language Generators But Context-free grammars cannot generate everything
Ex: Strings of the form WW in {0,1}
Cannot guarantee that arbitrary string is the same on both sides
Compare to WWR
These we can generate from the “middle” and build out in each direction
For WW we would need separate productions for each side, and we cannot coordinate the two with a context-free grammar
Need Context-Sensitive in this case
67. 67 Language Generators Let’s look at one more grammar example
Grammar to generate all postfix expressions involving binary operators * and -. Assume <id> is predefined and corresponds to any variable name
Ex: v w x y - * z * -
How do we approach this problem?
Terminals – easy
Nonterminals/Start – require some thought
Productions – require a lot of thought
68. 68 Language Generators T = { <id>, *, - }
N = { A }
S = { A }
P =
A ? AA* | AA- | <id>
Show parse tree for previous example
Is this grammar LL(1)?
We will discuss what this means soon
69. 69 Parsers Ok, we can generate languages, but how to recognize them?
We need to convert our generators into recognizers, or parsers
We know that a Context-free grammar corresponds to a Push-Down Automaton (PDA)
However, the PDA may be non-deterministic
As we saw in examples, to create a parse tree we sometimes have to “guess” at a substitution
70. 70 Parsers May have to “guess” a few times before we get the correct answer
This does not lend itself to programming language parsing
We’d like parser to never have to “guess”
To eliminate guessing, we must restrict the PDAs to deterministic PDAs, which restricts the grammars that we can use
Must be unambiguous
Some other, less obvious restrictions, depending upon parsing technique used
71. 71 Parsers There are two general categories of parsers
Bottom-up parsers
Can parse any language generated by a Deterministic PDA
Build the parse trees from the leaves up back to the root as the tokens are processed
At each step, a substring that matches the right-hand side of a production is substituted with the left side of the production
“Reduces” input string all the way back to the start symbol for the grammar
Also called “shift-reduce” parsing
72. 72 Parsers Correspond LR(k) grammars
Left to right processing of string
Rightmost derivation of parse tree (in reverse)
k symbols lookahead required
LR parsers are difficult to write by hand, but can be produced systematically by programs such as YACC (Yet Another Compiler Compiler).
Primary variations of LR grammars/parsers
SLR (Simple LR)
LALR (Look Ahead LR)
LR – most general but also most complicated to implement
We'll leave details to CS 1622
73. 73 Parsers Top-down parsers
Build the parse trees from the root down as the tokens are processed
Also called predictive parsers, or LL parsers
Left-to-right processing of string
Leftmost derivation of parse tree
The LL(1) that we saw before means we can parse with only one token lookahead
More restrictive than LR parsers: there are grammars generated by Deterministic PDAs that are not LL grammars (i.e. cannot be parsed by an LL parser)
Some restrictions on productions allowed
Cannot handle left-recursion – we'll see why shortly
74. 74 Parsers Implementing a top-down parser
One technique is Recursive Descent
Can think of each production as a function
As string of tokens is parsed, terminal symbols are consumed/processed and non-terminal symbols generate function calls
Now we can see why left-recursive productions cannot be handled
From Example 3.4
<expr> ? <expr> + <term>
Recursion will continue indefinitely without consuming any symbols
75. 75 Parsers Luckily, in most cases a grammar with left recursion can be converted into one with only right-recursion
Recursive Descent parsers can be written by hand, or generated
Think of a program that processes the grammar by creating a function shell for each non-terminal
Then details of function are filled in based upon the various right-hand sides the non-terminal generates
See example
76. 76 LL(1) Grammars So how can we tell if a grammar is LL(1)?
Given the current non-terminal (or left side of a production) and the next terminal we must be able to uniquely determine the right side of the production to follow
Remember that a non-terminal can have multiple productions
As we previously mentioned, the grammar must not be left recursive
However, not having left recursion is necessary but not sufficient for an LL(1) grammar
77. 77 LL(1) Grammars Ex:
A ? aX | aY
We cannot determine which right side to follow without more information than just "a"
How can we process a grammar to determine if this situation occurs?
Calculate the First set for each RHS of productions
First set of a sequence of symbols, S, is the set of terminals that begin the strings derived from S
Given multiple RHS for nonterminal N
N ? ?1 | ?2
If First(?1) and First(?2) intersect, the grammar is not LL(1)
78. 78 LL(1) Grammars So how do we calculate First() sets?
Algorithm is given in Aho (see Slide 2)
Consider symbol X
If X is a terminal, First(X) = {X}
If X ? e is a production, add e to First(X)
If X is a nonterminal and X ? Y1Y2…Yk is a production
Add a to First(X) if, for some i, a is in First(Yi) and e is in all of First(Y1) … First(Yi-1)
Add e to First(X) if e is in First(Yj) for all j = 1, 2, … k
To calculate First(X1X2…Xn) for some X1X2…Xn
Add non-e symbols of First(X1)
If e is in First(X1), and non-e symbols of First(X2)
…
If e is in all First(Xi), add e
79. 79 LL(1) Grammars A ? aB | b | cBB
B ? aB | bA | aBb
A ? aB | CD | E | e
B ? b
C ? cA | e
D ? dA
E ? dB LL1
Not LL1
Not LL1LL1
Not LL1
Not LL1
80. 80 Semantics Sematics indicate the “meaning” of a program
What do the symbols just parsed actually say to do?
Two different kinds of semantics:
Static Semantics
Almost an extension of program syntax
Deals with structure more than meaning, but at a “meta” level
Handles structural details that are difficult or impossible to handle with the parser
Ex: Has variable X been declared prior to its use?
Ex: Do variable types match?
81. 81 Semantics Dynamic Semantics (often just called semantics)
What does the syntax mean?
Ex: Control statements
Ex: Parameter passing
Programmer needs to know meaning of statements before he/she can use language effectively
82. 82 Semantics Static Semantics
One technique for determining/checking static semantics is Attribute Grammars
Start with a context-free grammar, and add to it:
Attributes (for the grammar symbols)
Indicate some properties of the symbols
Attribute computation functions (semantic functions)
Allow attributes to be determined
Predicate functions
Indicate the static semantic rules
83. 83 Semantics Attributes – made up of synthesized attributes and inherited attributes
Synthesized Attributes
Formed using attributes of grammar symbols lower in the parse tree
Ex: Result type of an expression is synthesized from the types of the subexpressions
Inherited Attributes
Formed using attributes of grammar symbols higher in the parse tree
Ex: Type of RHS of an assignment is expected to match that of LHS – the type is inherited from the type of the LHS variable
84. 84 Semantics Semantic Functions
Indicate how attributes are derived, based on the static semantics of the language
Ex: A = B + C;
Assume A, B and C can be integers or floats
If B and C are both integers, RHS result type is integer, otherwise it is float
Predicate functions
Test attributes of symbols processed to see if they match those defined by language
Ex: A = B + C
If RHS type attribute is not equal to LHS type attribute, error (in some languages)
85. 85 Semantics Detailed Example in text
Grammar Rules
1) <assign> ? <var> = <expr>
2) <expr> ? <var> + <var>
3) <expr> ? <var>
4) <var> ? A | B | C
Attributes
actual_type: actual type of <var> or <expr> in question (synthesized, but for a <var> we say this is an intrinsic attribute)
expected_type: associated with <expr>, indicating the type that it SHOULD be – inherited from actual_type of <var>
86. 86 Semantics Semantic functions
Parallel to syntax rules of the grammar
See Ex. 3.6 in text
1) <assign> ? <var> = <expr>
<expr>.expected_type ? <var>.actual_type
2) <expr> ? <var>[2] + <var>[3]
<expr>.actual_type ? if (<var>[2].actual_type = int) and
(<var>[3].actual_type = int) then int
else real
end if
3) <expr> ? <var>
<expr>.actual_type ? <var>.actual_type
4) <var> ? A | B | C
<var>.actual_type ? look-up(<var>.string)
Predicate functions
Only one needed here – do the types match?
<expr>.actual_type == <expr>.expected_type
87. 87 Semantics Ex: A = B + C
88. 88 Semantics Attribute grammars are useful but not typically used in their pure form for full-scale languages – makes the grammars more complicated and compilers more difficult to generate
89. 89 Semantics Dynamic Semantics (semantics)
Clearly vital to the understanding of the language
In early languages they were simply informal – like manual pages
Efforts have been made in later years to formalize semantics, just as syntax has been formalized
But semantics tend to be more complex and less precisely defined
More difficult to formalize
90. 90 Semantics Some techniques have gained support however
Operational Semantics
Define meaning by result of execution on a primitive machine, examining the state of the machine before and after the execution
Axiomatic Semantics
Preconditions and postconditions define meaning of statements
Used in conjunction with proofs of program correctness
Denotational Semantics
Map syntactic constructs into mathematical objects that model their meaning
Quite rigorous and complex
91. 91 Identifiers, Reserved Words and Keywords Identifier
String of characters used to name an entity within a program
Most languages have similar rules for ids, but not always
C++ and Java are case-sensitive, while Ada is not
Can be a good thing: mixing case allows for longer, more readable names, ala Java: NoninvertibleTransformException
Can be a bad thing: should that first i be upper or lower case?
92. 92 Identifiers, Reserved Words and Keywords C++, Ada and Java allow underscores, while standard Pascal does not
FORTRAN originally allowed only 6 chars
Reserved Word
Name whose definition is part of the syntax of the language
Cannot be used by programmer in any other way
Most newer languages have reserved words
Make parsing easier, since each reserved word will be a different token
93. 93 Identifiers, Reserved Words and Keywords Ex: end if in Ada
Interesting extension topic
If we extend a language and add new reserved words, we may make some old programs syntactically incorrect
Ex: C subprogram using class as an id will not compile with a C++ compiler
Ex: Ada 83 program using abstract as an id will not compile with an Ada 95 compiler
Keywords
To some, keyword ? reserved word
Ex: C++, Java
94. 94 Identifiers, Reserved Words and Keywords To others, there is a difference
Keywords are only special in certain contexts
Can be redefined in other contexts
Ex: FORTRAN keywords may be redefined
Predefined Identifiers
Identifiers defined by the language implementers, which may be redefined
cin, cout in C++
real, integer in Pascal
predefined classes in Java
95. 95 Identifiers, Reserved Words and Keywords Programmer may wish to redefine for a specific application
Ex: Change a Java interface to include an extra method
Problem: predefined version no longer applies, so program segments that depend on it are invalid
Better to extend a class or compose a new class than to redefine a predefined class
Ex: Comparable interface can be implemented as we see fit by a new class
96. 96 Variables Simple (naďve) definition: a name for a memory location
In fact, it is really much more
Six attributes
Name
Address
Value
Type
Lifetime
Scope
97. 97 Variables Name:
Identifier
In most languages the same name may be used for different variables, as long as there is no ambiguity
Some exceptions
Ex: A method variable name may be declared only once within a Java method
98. 98 Variables Address
Location in memory
Also called the l-value
Some situations that are possible:
Different variables with the same name have different addresses
Declared in different blocks of the program
Same variable has different addresses at different points in time
Declared in a subprogram and allocated based on run-time stack
We’ll discuss this more shortly
99. 99 Variables Different variables share same address: aliasing
Occurs with FORTAN EQUIVALENCE, Pascal and Ada record variants, C and C++ unions, pointer variables, reference parameters
Adds to the flexibility of a language, especially with pointers and reference parameters
Can also save memory in some situations
Many references to a single copy rather than having multiple copies
Can be quite problematic if programmer does not handle them correctly
Ex: copy constructor and = operator for classes with dynamic components in C++
Ex: shallow copy of arrays in Java
We’ll discuss most of these things more in later chapters
100. 100 Variables Type
Modern data types include both the structure of the data and the operations that go with it
Important in determining legality of some expressions
Value
Contents of memory locations allocated for that variable
Also called the r-value
101. 101 Variables Lifetime
Time during which the variable is bound to a specific memory location
We’ll discuss this more shortly
Scope
Section of the program in which a variable is visible
Accessible to the programmer/code in that section
We’ll discuss this more shortly
102. 102 Binding Binding of variables deals with an association of variable attributes with actual values
The time when each of these occurs in a program is the binding time of the attribute
Static binding: occurs before runtime and does not change during program execution
Dynamic binding: occurs during runtime or changes during runtime
103. 103 Binding Name:
Occurs when program is written (chosen by programmer) and for most languages will not change: static
Address and Lifetime:
A memory location for a variable must be allocated and deallocated at some point in a program
The lifetime is the period between the allocation and deallocation of the memory location for that variable
104. 104 Binding We are ignoring bindings associated with a computer’s virtual memory
This in fact could cause a variable to be bound and unbound to different memory locations many times throughout the execution of a program
Each time the data is swapped or paged out, the location is physically unbound, and then rebound when it is swapped or paged back in
We will ignore these issues since they are more related to how the operating system and hardware execute programs than they are to the way the variables are declared and used within the program
105. 105 Binding Text puts lifetimes of variables into 4 categories
Static: bound to same memory cell for entire program
Ex: static C++ variables
Stack-dynamic: bound and unbound based on run-time stack
Ex: Pascal, C++, Ada subprogram variables
Explicit heap-dynamic: nameless cells that can only be accessed using other variables (pointers)
Allocated (and poss. deallocated) explicitly by the programmer
Ex: result of new in C++ or Java
106. 106 Binding Implicit heap-dynamic variables
Binding of all attributes (except name) changed upon each assignment
Much overhead to maintain all of the dynamic information
Used in Algol 68 and APL
Not used in most newer languages
See lifetimes.cpp
Value
Dynamic by the nature of a variable – can change during run-time
107. 107 Binding Type
Dynamic Binding
Type associated with a variable is determined at run-time
A single variable could have many different types at different points in a program
Static Binding
Type associated with a variable is determined at compile-time (based on var. declaration)
Once declared, type of a variable does not change
108. 108 Binding Advantage of dynamic binding
More flexibility in programming
Can use same variable for different types
Can make operations generic
Disadv. of dynamic binding
Type-checking is limited and must be done at run-time
To understand this better, we must discuss type-checking at more length
We’ll return to our last binding topic (scope) afterward
109. 109 Type-checking Type checking
Determination that operands and operators in an expression have types that are compatible with each other
By compatible we mean
the types match or
they can be coerced (implicitly converted) to match
If they are not compatible, a type-error occurs
110. 110 Type-checking Why is type-checking important?
Let’s review programming errors
Compilation error: detectable when the program is compiled
Usually a syntax error or static semantic error
Run-time error: detectable as the program is being run
Often an illegal instruction or I/O error
Logic error: error in meaning of the program
Often only detectable through debugging and/or testing program on known data
Program could seem to run perfectly, but produce incorrect results
111. 111 Type-checking We’d like an environment that is not conducive to logic errors
Consider dynamic type binding again
Assignments cannot be type checked
Since type may change, any assignment is legal
If object being assigned is an erroneous type, we have a logic error
Type checking that is done must be done at run-time
This requires type information to be stored and accessed at run-time
Must be done in software – i.e. the language must be interpreted
Increases both memory and run-time overhead
112. 112 Type-checking Now consider static type binding
Since types are set at compile-time, most (but not usually all) type checking can be done at compile-type
Assignments can be checked to avoid logic errors
Type information does not need to be kept at run-time
Program can run in hardware
113. 113 Type-checking STRONGLY TYPED language
2 slightly different definitions
Traditional definition: If ALL type checking can be done at compile-time (i.e. statically), a language is strongly typed
Sebesta definition: If ALL type errors can always be detected (either at compile-time or at run-time) a language is strongly typed
First definition is more reliable but also more restrictive
114. 114 Type-checking No commonly used languages are truly strongly typed, but some come close
Let’s look at two: C++ and Ada
C++ union construct allows programmer to access same memory as different types with no checking
Ada record variants contain a discriminant to determine which type is being used
Can only access the type indicated by the discriminant
Cannot change the discriminant unless you change the entire record – prevents inconsistency
Checking can be turned off, however
115. 115 Type-checking See union.cpp and variant.adb
Pascal also has a discriminant, but does not require entire record to be assigned when discriminant is changed
Suggests type-safety, but does not enforce it
We’ll look more at these types of structures in Chapter 6
116. 116 Type-checking So what does “compatible” mean?
We said type-checking involves determining if operands and operators are compatible
Different languages define it differently
Name compatible (or name equivalent)
The types have the same name
Easy to check and enforce
Simply record the type of each variable in the symbol table
Compare when necessary
117. 117 Type-checking Somewhat limiting for programmer
Ex. in Pascal
A1: array[1..10] of integer;
A2: array[1..10] of integer;
A2 := A1; { not allowed }
Assignment is not legal even though they have the same structure
Also cannot pass either variable as a parameter to a subprogram
Variables above actually each have an ANONYMOUS TYPE – not name compatible with any other type
Generally not a good idea to use
118. 118 Type-checking Structurally compatible (equivalent)
The types are compatible if they have the same structure
Ex:
A1: array[1..10] of integer;
A2: array[1..10] of integer;
A2 := A1; { this would be allowed }
Since both have same size and base type, it works
Much more flexible than name compatible, but also much more difficult for compiler and not as clear to programmer
Compiler must compare structures of every variable involved in the expression
It is not obvious what is and is not compatible
119. 119 Type-checking record
X: float;
A: array[1..10] of int;
end record
Can compiler tell that components are reversed?
A1:array[1..10] of float;
Structure is the same
record
A: array[1..10] of int;
X: float;
end record
Could it “match” the types with more complex records?
A2:array[3..12] of float;
Index values are changed
120. 120 Type Checking So how about some languages?
Ada: name equivalence but allows subtypes to be compatible with parent types
Pascal: almost name equivalence, but considers variables in the same declaration to also be of the same type, and allows one type name to be set equal to another
A1, A2: array[1..10] of integer;
above not compatible in Ada
type newint = integer;
C++: name equivalence, but a lot of coercion is done – we will look at coercion later
See types.p, types.adb, types.cpp
121. 121 Binding Ok, back to binding
Scope (visibility)
Static scope: determined at compile-time
Dynamic scope: determined at run-time
Implications:
Most languages have the notion of “local” variables (i.e. within a block or a subprogram)
If these are the only variables we use, scope is not important
122. 122 Binding Scope become significant when dealing with non-local variables
How and where are these accessible?
Most modern languages use static scope
If variable is not locally declared, proceed out to the “textual parent” (static parent) of the block/subprogram until the declaration is found
Fairly clear to programmer – can look at code to see scope
We’ll discuss implementation later
123. 123 Binding Examples:
Pascal
Subprograms are only scope blocks, but can be nested to arbitrary depth
All declarations must be at the beginning of a subprogram
Somewhat restrictive, although not the fault of static scope
Ada
Subprograms can be nested
Also allow declare blocks to nest scope within the same subprogram
Useful to variable length arrays
All declarations must be at the beginning of a subprogram or declare block
See arraytest.adb
124. 124 Binding C++
Subprograms CANNOT be nested
New declaration blocks can be made with {}
Declarations can be made anywhere within a block, and last until the end of the block
Interesting note:
What about the scope of loop control variables in for loops?
Ada always implicitly declares LCVs and scope (and lifetime) is LOCAL to the loop body
C++ and Java for is more general and does not require a locally declared LCV, but if included, is also LOCAL to the loop body
Wasn’t always the case in C++
125. 125 Binding Dynamic scope
Non-local variables are accessed via calls on the run-time stack (going from top to bottom until declaration is found)
A non-local reference could be declared in different blocks in different runs
Used in APL, SNOBAL4 and through local variables in Perl
Flexible but very tricky
Difficult for programmer to anticipate different definitions of non-local variables
126. 126 Binding Type-checking must be dynamic, since types of non-locals are not known until run-time
More time-consuming
See scope.pl
127. 127 Scope, Lifetime and Referencing Environments Concepts of Scope and Lifetime are not comparable
Lifetime deals with existence and association of memory
WHEN
Scope deals with visibility of variable names
WHERE
However, sometimes they seem to be “the same”
128. 128 Scope, Lifetime and Referencing Environments Ex. stack-dynamic variables in some places
Ex. global variables in some situations
#include <iostream>
using namespace std;
int i = 10;
int main()
{
{
int i = 11, j = 20;
{
int i = 12, k = 30;
cout << i << " " << j << " " << k << endl;
}
cout << i << " " << j << endl;
}
cout << i << endl;
}
129. 129 Scope, Lifetime and Referencing Environments More often they are not “the same”
Lifetime of stack-dynamic variables continues when a subsequent function is called, whereas scope does not include the body of the subsequent function
Lifetime of heap-dynamic variables continues even if they are not accessible at all
130. 130 Scope, Lifetime and Referencing Environments Referencing Environment
Given a statement in a program, what variables are visible there?
Clearly this depends on the type of scoping used
Once we know the scope rules, we can always figure this out
From code if static scoping is used
From call chains (at run-time) if dynamic scoping is used
We will look more at non-local variable access when we discuss subprograms
131. 131 Primitive Data Types Most languages provide these
Numeric types
Integer
Usually 2’s complement
Exact representation of the number
In some languages (ex. C++) size is not specified
Depends on hardware
In others (ex. Java) size is specified
Regardless of hardware
132. 132 Primitive Data Types Float
Usually exponent/mantissa form, each coded in binary
Often an approximation of the number
Only a limited number of bits for mantissa
Decimal point “floats”
Many numbers cannot be represented exactly
Irrational numbers (ex. PI, e, 21/2)
Infinite repeating decimals (ex. 1/3)
Some terminating decimals as well (ex. 0.1)
Instead we use digits of precision
In most new computers, this is defined by IEEE Standard 754
See p. 255 in text
See rounding.cpp
133. 133 Primitive Data Types Fixed point
Called “Decimal” in text
Store rational numbers with a fixed number of decimal places
Useful if we need decimal point to stay put
Ex. Working with money
Can be stored as Binary Coded Decimal – each digit encoded in binary
Similar to strings, but we can put 2 digits into one byte, since each digit needs only 4 bits
However, 10 digit values do not actually require 4 full bits
So memory is somewhat wasted in this representation
134. 134 Primitive Data Types Can also be stored as integers, with an extra attribute in the descriptor
Scale factor indicates how many places over to locate decimal point
Ex: X = 102.53 and Y = 32.65
Stored as 10253 and 3265 (in binary) with scale factor of 2
Z = X + Y
Add the integers and keep the same scale factor (involves normalization if the number of decimal places are not the same – think about this)
= 13518 with scale factor 2 = 135.18
Z = X * Y
Multiply the integers and add the scale factors, then normalize to the correct number of digits
= 33476045 with scale factor 4
= 3347.6045 normalized to 2 digits
= 3347.60
135. 135 Primitive Data Types But clearly limited in size by integer size
In Java the BigDecimal class (not primitive) stores them as (very long) integers with a 32 bit scale factor – see BigD.java
Very Long Numbers
Typically NOT primitive types
Implemented using arrays via a predefined class (ex. BigInteger and BigDecimal in Java) or via an add-on library (NTL library for C++)
Boolean
Used in most languages (except C)
Values are true or false
136. 136 Primitive Data Types Though stored as integer values (0 or 1), boolean values are typically not compatible with integer values
Exception is C++, where all non-zero values are true, 0 is false
This adds flexibility, but can cause logic errors
Recurring theme!
Remember assign.cpp
137. 137 Primitive Data Types Character
In early computers, character codes were not standardized
Caused trouble for transferring text files
Now most computers use ASCII character set (or extended ASCII set)
But ASCII is not adequate for international character sets
Unicode is used in Java, perhaps in other languages soon
16 bit code
138. 138 Primitive Data Types ASCII is also not the same for Unix platforms and Windows platforms
Unix uses only a <LF> while Windows uses a <CR><LF> combination
Can cause problems for programs not equipped to handle the difference
We can easily convert, however
FTP in ASCII mode
Simple script to convert
139. 139 More Data Types Strings
Not a primitive type in many languages
C, C++, Pascal, Ada
Primitive type in others
BASIC, FORTRAN, Perl
Issues to consider:
Should a string be considered a single entity, or a collection of characters?
Should the length of a string be fixed or variable?
Which operations can be used?
140. 140 More Data Types Single entity vs. collection of characters
More an issue of access than of structure
In languages with no primitive string type, a string is generally perceived as a collection of characters
Ex: In C and C++ a string is an array of char
If we want we can treat it like any other array
If we use operations in string.h (or <cstring>) we can treat it like a string
C++ also has string class
Ex: In Ada String is a predefined unconstrained array of characters
String variables must be constrained
Some simple operations can be used (ex. assignment, comparison, concatenation)
141. 141 More Data Types In Perl, string is a primitive type
Strings are accessed as single entities
We can use functions to get character index values, but strings are stored as single, primitive variable values
Many operations can be used (later)
In Java, we have two string types
Neither is really primitive, since they are classes built using an underlying array of char
String
Immutable objects (cannot be changed once created)
Allows strings to be stored and accessed more efficiently, since multiple objects with the same value can be replaced with a single object (at compile-time – usually done for literals)
142. 142 More Data Types StringBuffer
Objects that can be modified after creation
More efficient if programmer is altering string values, since new objects do not have to be created for each operation
String length
3 variations
Static (fixed) length – length of string is set when the object is created and cannot change
Limited dynamic length – length of string can vary up to a preset maximum
Typically the dynamic aspect is logical – physically the string size is preset, but some of the space may not be used
Dynamic length – length of string can vary without bound
143. 143 More Data Types Used in Pascal, Ada
String variable is declared to be of a fixed size, with all locations being part of the string
It is up to programmer to either pad string at end or somehow ignore unused locations
However, Ada improves on Pascal strings by making the String type an unconstrained array
String objects must have a fixed length, but they are all of the same type (ex: for params)
See astrings.adb and pstrings.p
Used in C, C++ (c strings)
String variable is of a fixed maximum size, but any number of characters up to the maximum length may be used at any time
In C and C++ a special sentinel character (\0) is used to indicate the end of the logical string
Implementation is VERY dangerous (what else is new!)
See cstrings.cpp
144. 144 More Data Types Used in Perl, Java StringBuffer, C++ string class (among others)
Physical length of the string object is automatically adjusted by the system to hold the current length logical string
Memory is reallocated if necessary
Only limit on string size is amount of memory available
Realize that this is not free: each time the string has to be resized the following occurs:
New memory of new size is allocated
Old data is copied to new memory
Variable is rebound to new memory
Old memory is discarded
It is smart for system to double memory each time so resizing is infrequent
We’ll discuss allocation later
See Strings.java
145. 145 More Data Types Operations
What can we do with our strings?
These are affected by how the previous issue (length)
If the length must remain constant, we cannot have any operations that would change the length
Some possibilities:
Slicing (give a substring)
Search for a string
Index
Regular expression matching
146. 146 More Data Types Enumeration Types
Used primarily for readability
Ex: January, February… vs. 0, 1, …
Typically map identifiers to integer values
Pascal
Identifiers can be used only once
I/O not predefined (user must write I/O operations for each new enum. type)
Major negative
Can be used in for loops and case statements
147. 147 More Data Types Ada
Ids can be overloaded as long as the correct definition can be determined through context
type colors is (red, blue, yellow, purple, green);
type signals is (red, yellow, green);
Now the literal green (or yellow or red) is ambiguous in and of itself
We must qualify it in the code to the desired type
for C in signals’red..signals’green
I/O can be done through a generic I/O package – very helpful
Allows values to be output without having to write new functions each time
148. 148 More Data Types C++
Enum values are converted to ints by the compiler
Lose the type checking that Ada and Pascal provide
Java – added Enum types in 1.5
All are a subclass of Enum so are objects
Subrange types
Improve readability (new name id) and logic-error checking (restricting ranges)
Provided in Pascal and Ada
149. 149 More Data Types Arrays
Traditionally a homogeneous collection of sequential locations
Many issues to consider with arrays – a few are:
How are they indexed?
How/when is memory allocated?
How are multidimensional arrays handled?
What operations can be done on arrays?
150. 150 More Data Types Indexing
Mapping that takes an array ID and an index value and produces an element within the array
Y = A[X];
Range of index values depends on two things:
How many items are in the array?
This depends on how allocation is allowed in a language
We will discuss more with memory allocation
Are any bounds implicit?
C, C++, Java arrays start at 0
Ada, Pascal can start anywhere
151. 151 More Data Types Whether range of indexes is static or dynamic, the actual subscripts must be evaluated at run-time
Subscript could be a variable or expression whose value is not known until run-time
If value is not within the range of the array, a range-error occurs
VERY COMMON logic error made in programs
Checked in Ada, Pascal and Java
Not checked in FORTRAN, C, C++, much to the chagrin of programmers
Q: Why aren’t range-errors checked in C, C++?
152. 152 More Data Types Memory allocation
Remember discussion of allocation during Binding lecture
Basically those bindings hold, with some slight additions
Static: size of array is fixed and allocation is done at compile-time
Fixed stack-dynamic: size of array is fixed but allocation is done on run-time stack during execution
Pascal arrays
"Normal" C++ arrays
153. 153 More Data Types Stack-dynamic: size of array is determined at run-time and array is allocated on run-time stack; however, once allocated, size is fixed
Ada arrays
See prev. Ada array example
Heap-dynamic:
Explicit: programmer allocates and deallocates arrays
C++, Java, FORTRAN 90 dynamic arrays
Implicit: array is resized automatically as needed by system
Perl, Java ArrayList
See examples on board
154. 154 More Data Types Multiple Dimension Arrays
2-D Array: array of vectors – plane – Matrix
3-D Array: array of matrices or planes
Higher dimension arrays are not so easily envisioned, but in most languages are legal
Subscripts for each dimension can be different ranges, and, in fact even different types
Ex. Pascal:
type matrix = array[1..10, ‘A’..’Z’] of real;
Indexing function is a bit more complicated, as we will see next
155. 155 More Data Types Implementing Arrays (1-D)
Two questions we need to answer:
What information do we need at compile-time?
What information do we need at run-time?
The answers to these questions depend somewhat on other language properties
If dynamic type-binding is used, most information must be kept at run-time
Even with static type-binding, some information may be needed at run-time
156. 156 More Data Types Let’s assume static type-binding, since most languages use this
What do we need to know at compile-time?
Element type
Index type
Above two needed for type checking
Element size
Index range (lower and upper bounds)
Array base address
Above items are needed to construct the indexing function for the array
Let’s see how the indexing function is created
Given an array, A, we want a function to return the Lvalue (address) of ith location A
157. 157 More Data Types Assume:
E = Element size
LB = Lower index bound
UB = Upper index bound
S = start (base) address of array
Lvalue(A[i]) = S + (i – LB) × E
= (S – LB × E) + (i × E)
Note that once S is known (array has been bound to memory), left part of the equation is constant
So for each array access, only (i × E) must be calculated
If i is a constant, this too can be precalculated (ex. A[5])
158. 158 More Data Types What do we need to know at run-time?
This depends on what kind of checking we are doing
Ex: C/C++ need only the start address (S) of the array and the element size, (E)
LB is always 0
UB is not checked
Ex: Ada, Pascal need more:
LB and UB needed for bounds checking
Run-time information for an array is typically stored in a run-time descriptor (sometimes called a dope-vector)
This may be stored in the same memory block as the data itself (ex. at the beginning), or elsewhere in memory
159. 159 More Data Types Implementing Arrays (multi-D)
This is more complicated than 1-D, since computer memory is typically linear
Thus multi-D arrays are still stored physically as 1-D arrays
Language must allow them to be accessed in a multi-D way
Ex. Two-D arrays
In most languages stored in row-major order
Line up rows end-to-end to produce the linear physical array
160. 160 More Data Types Column-major order also exists – used in FORTRAN
Index function for 2-D arrays is a bit more complicated than for 1-D, but is still not that hard to calculate
lvalue(A[i,j]) = S + (i – LB1)×ROWLENGTH
+ (j – LB2)×E
where ROWLENGTH = (UB2 – LB2+ 1)×E
Do example
Given int A[10][5] calculate A[4][3]
161. 161 More Data Types Can we access arrays in any other way?
In some languages (ex: C, C++) we can use pointers for array access
In this case rather than calculating the index of the location as an offset from the base address, we move the pointer to the address of the location we want to access
For sequential access of the items in an array, this is much more efficient than using traditional indexing
Of course, it is also dangerous!
See ptrarrays.cpp
162. 162 More Data Types Array Operations
What can be done to array as a whole?
Assignment
Allowed in languages in which array is actually a data type (ex. Ada, Pascal, Java)
Usually bitwise copy between arrays (shallow copy)
C/C++ do not allow assignment since array variables are simply (constant) pointers
Comparison
Equal and not equal allowed in Ada
With Java references are compared
More complex operations are allowed in some languages (ex. APL)
163. 163 More Data Types Associative Arrays
Instead of being indexed on integers (or simple enumeration values), index on arbitrary keys (usually strings)
Ex: Perl hash (also exists in Smalltalk and Java) and PHP arrays
Java and other langs also have Hashtable classes
Typically done following hashing procedures
Hash function used to map key to an index, mod the table size
Keys are dispersed in a random fashion, to reduce collisions
164. 164 More Data Types Typically in these language-implemented hashes, the table size is not shown to the programmer
Programmer does not need to know details in order to use it
Usually it grows and shrinks dynamically as the table fills or becomes smaller
Resizing is expensive – all items must be rehashed into new sized table
Would like to not have to resize too often
Doubling size with each expansion is a good idea
165. 165 More Data Types Records
Heterogeneous collections of data, accessed by component names
Very useful in creating structured data types
Forerunners of objects (do not have the operations, just the data)
Access fairly uniform across languages with dot notation
Fields accessed by name since they may have varying size
166. 166 More Data Types Pointers
Variables that store addresses of other variables
In most high-level languages, their primary use is to allow access of dynamic memory
Exception: C, where pointers are required for “reference” parameters and where they are also used heavily for array access
We’ll talk about parameters later
167. 167 More Data Types To access memory being pointed to, a pointer is dereferenced
Usually explicit:
C++ Ex: *P = 6.02;
Implicit in some contexts:
Ada Ex: rec_ptr.field
168. 168 More Data Types Pointer Issues:
Should pointers be typed like other variables?
How do the scope and lifetime of the pointer variable relate (if at all) to the scope and lifetime of the heap-dynamic variable?
What are the safety/access issues a programmer using pointers must be concerned with?
How is heap-dynamic memory maintained?
What are reference types and how are they similar/different to pointer types?
169. 169 More Data Types Types
Most languages with pointers require them to be typed
This allows type-checking of heap-dynamic memory
Scope/Lifetime/Safety/Acess
In most languages the scope and lifetime of heap-dynamic variables are distinct from those of the pointer variables
If pointer variable is a stack-dynamic variable, its scope and lifetime are as we discussed previously
Heap-dynamic variables are also as we discussed previously
170. 170 More Data Types Thus we have the potential for 2 different problems:
Lifetime of pointer variable accessing a heap-dynamic variable ends, but heap-dynamic variable still exists
Now heap-dynamic variable is no longer accessible
This problem can also occur by simply reassigning the pointer variable
Now we have formed GARBAGE – heap-dynamic variable that can no longer be accessed
Different languages handle garbage in different ways – we will discuss them shortly
171. 171 More Data Types Heap-dynamic variable is in some way deallocated (returned to system), but pointer variable is still storing its address
Now address stored in pointer is invalid
Can cause problems if it is dereferenced, especially if the memory has been reallocated for some other use
This is a dangling reference
Can be catastrophic to a program, as most C++ programmers know
To deal with these problems, we must discuss how heap-dynamic memory is maintained
172. 172 Heap Maintenance There are two issues to consider for heap-dynamic memory maintenance
How are blocks of memory stored and accessed?
What is the structure of the heap itself?
How is memory allocated and deallocated?
What protocols are used?
173. 173 Heap Maintenance Structure of the heap
Simplest option is to make it a linked-list of identical cells
Each cell is the same size and points to the next cell in the list (so it contains a pointer)
Available cells are in a free list and memory is always allocated from that list
Works in a language like LISP, but a language like C++ requires large dynamic arrays and objects of different sizes
Also cells do not always have pointers
174. 174 Heap Maintenance Another option is to allow the cell sizes within the heap to vary
More complicated than single-size cells
Free space must be maintained differently
Initially entire heap is stored in a single block
As memory requests are granted, chunks of the heap are partitioned off
First Fit, Best Fit, Worst Fit
As memory is returned and more is given out, the heap begins to fragment
Eventually, fragments are too small to be of use
Common problem in CS with many things (ex: disk)
We must try to reunify small adjacent blocks into larger blocks
175. 175 Heap Maintenance Heap Compaction
Rearrangement of heap to create larger free memory blocks
May become necessary after long use and much fragmentation
Partial compaction
No active blocks are moved
Compaction occurs through combination of adjacent free blocks
Fairly simple to do – each time a block is freed we can check for adjacent free blocks
Full compaction
Active blocks are moved to create more contiguous free space
Complex and time-consuming – if a block is moved ALL references to it must be sought out and changed
How?
176. 176 Heap Maintenance Memory Allocation/Deallocation
2 basic strategies:
Make the programmer responsible for allocation and deallocation
Everything is explicit
Let the system be responsible for deallocation
Reference counts (eager)
Garbage collection (lazy)
177. 177 Heap Maintenance Programmer is responsible
This is the approach taken by C, C++ and (more or less) Pascal
C: malloc and free
C++: new and delete
Pascal: new and dispose
Very flexible – programmer has complete control
Very efficient – reasonable system overhead (unless full compaction is used)
178. 178 Heap Maintenance Very dangerous – programmer can get into serious trouble
Remember two problems we mentioned before – dangling references and garbage
Avoidance of these problems is completely up to the programmer
If programmer does NOT avoid these, serious problems can result
Dangling references can lead to memory corruption and program crash
Garbage creates a memory leak that eventually could consume all of the heap space
So programmer must be very careful and test/debug the code extensively
179. 179 Heap Maintenance System is responsible
Depending on language and variables, programmer either explicitly or implicitly allocates dynamic memory
Explicit: new in Java
Implicit: append to StringBuffer in Java, many operations in Lisp
But now the deallocation is either partially or fully implemented implicitly by the system
Idea is to avoid the problems just shown: dangling references and garbage
180. 180 Heap Maintenance Reference Counts
For each block of memory allocated from the heap, a counter is kept of the number of references to that block
Any assignment of a pointer to that address increments the counter
Any change of assignment from that address, or any delete() decrements the counter
If counter reaches 0, deallocate memory
delete or dispose should also set pointer in question back to NULL
Note now (for the most part) both dangling references and garbage are eliminated
181. 181 Heap Maintenance p = new float(3.5);
q = p;
r = q;
// reference count to float = 3
delete(q);
// rather than actually deleting the float, the
// count is decremented and q is set to NULL
r = new float(8.5);
p = new float(6.0);
// ref. count to orig. float now 0 and it is
// removed
Looks great! So why doesn’t C++ use this?
182. 182 Heap Maintenance Implementing Reference Counts
Cost of reference counts is high
Statements that were simple and cheap without reference counts become complex and expensive with them
Ex: q = p; from previous slide
In C++, copy the address
With reference counts:
Find q’s previous address and decrement its reference count (deleting if now equal to 0)
Copy address from p to q
Increment reference count of new address
The counts themselves must also be stored
Reference counts do not also (easily) work in all situations
183. 183 Heap Maintenance Ex: p = new node;
p -> next = new node;
p -> next -> next = p;
delete p;
Now we have a garbage loop
Each dynamic node has a reference to it, but neither is accessible
Clearly, circular references like this could be bigger
Can be resolved but not easily
Reference counts are not used much in typical high-level languages, but are used in some parallel-processing applications
Cost of using reference counts is spread out
184. 184 Heap Maintenance Garbage Collection
Different philosophy than reference counts
Once allocated, memory remains allocated indefinitely, whether accessible or not
Dangling references not possible
Garbage is allowed to accumulate
Eventually, if a program runs long enough and consumes enough memory, the program will run out
With new computers and virtual memory this may not always occur, especially in programs with only a limited number of objects
But if it does occur it is a problem – there is no more memory available to allocate
185. 185 Heap Maintenance When memory runs out, garbage must be collected so that it can be reused
How is this done?
Typical technique is “mark and sweep”
Mark – each item allocated has a garbage bit. This bit is initialized to 1 for all items, indicating that they are ALL garbage. Then all “active” items are found, and marked as non-garbage by setting the bit back to 0
Sweep – unmarked items (bits still 1) are returned to the free list
Again we have eliminated both problems
Dangling references don’t exist, since a reference would cause the item to be marked as active
Garbage accumulates but is reclaimed when needed
186. 186 Heap Maintenance Implementing Garbage Collection
Which operation seems more difficult, mark or sweep?
It is not at all obvious or simple to identify active memory
We want to make sure we don’t miss anything, or else we can cause serious problems
On the other hand, we don’t want to leave garbage behind either
So how can we reliably detect active memory?
As we saw with reference counts, it is not enough to have a reference to it
187. 187 Heap Maintenance We can do this in a systematic (time-consuming) way, assuming three conditions are true:
Any active element must be reachable through a chain of pointers originating outside of heap
Every pointer outside the heap that points into the heap can be identified
Pointer fields within active elements that point to other heap elements must be able to be identified
Basic idea: Start with pointers outside the heap, trace them to active heap elements within the heap, and trace those elements to others, etc.
188. 188 Heap Maintenance The necessary conditions are not always met
Ex. Met in LISP, Java, but not C or C++
C and C++ allow too much flexibility with pointer arithmetic
Ex: pointer P may not have a valid address of an element in the heap, but maybe P+offset does have a valid address
p = new node;
p += 4;
If garbage collector cannot “find” this node amongst the active nodes it will be deleted, and p-4 will be a dangling reference
Note: Time required for garbage collection is inversely proportional to amount of garbage that is reclaimed
Discuss
189. 189 Heap Maintenance Ada memory management
Remember a goal of Ada is to make the language as reliable as possible
How to allow pointers but maintain reliability?
Ada prevents dangling references by having no explicit delete in the language
Scope of dynamic memory is not longer than that of pointers that point to it
Prevents some garbage, but not all
Allows for garbage collection, but does not require it
190. 190 More Data Types Reference Types
We can think of a reference type as a pointer that is implicitly dereferenced, with restrictions on its use
No explicit deferencing
No arithmetic
In C++ they are also constant (cannot change the address)
Originally used for parameters and some return values in C++ functions
void foo(float ¶m)
191. 191 More Data Types In Java they are used for all non primitive values
Have the benefits of allowing access to dynamic memory in the heap
Since no arithmetic is allowed, garbage collection can be used and no dangling references occur
But we still have to be careful, since they are addresses
Shallow vs. deep copy
Comparisons