720 likes | 1.2k Views
Language Translation Issues. Lecture 5: Dolores Zage. Programming Language Syntax. The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of symbols, XY +- does not provides significant information for
E N D
Language Translation Issues Lecture 5: Dolores Zage
Programming Language Syntax • The arrangement of words as elements in a sentence to show their relationship • In C, X = Y + Z represents a valid sequence of symbols, XY +- does not • provides significant information for • understanding a program • translation into an object program • rules: 2 + 3 x 4 is 14 not 20 • (2+3) x 4 - specify interpretation by syntax - syntax guides the translator
General Syntactic Criteria • Provide a common notation between the programmer and the programming language processor • the choice is constrained only slightly by the necessity to communicate particular items of information • for example: a variable may be represented as a real can be done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRAN • general criteria: easy to read, write, translate and unambiguous
Readability • Algorithm is apparent from inspection of text • self-documenting • natural statement formats • liberal use of key words and noise words • provision for embedded comments • unrestricted length identifiers • mnemonic operator symbols • COBOL design emphasizes readability often at the expense of ease of writing and translation
Writeability • Enhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features) • FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index) • redundancy can be good • easier to read and allows for error checking
Translation • Ease of • Key of easy translation is regularity of structure • LISP can be translated in a few short easy rules, but it is a bear to read. • COBOL has large number of syntactic constructs -> hard to translate
Lack of ambiguity • Central problem in every language design! • Ambiguous construction allows for two or more different interpretations • these do not arise in the structure of individual program elements but in the interplay between structures • The dangling else is a classic example:
If then else • If (boolean) then • if(boolean) then • statement 1 • else • statement 2 B1 B1 S2 B2 B2 S1 S1 S2
Resolve dangling else • Include begin … end delimiter around embedded conditional -ALGOL • Ada-> delimiter end if • C and Pascal -> final else is paired with the nearest then
Character set • ASCII • 26 letters -> • other languages have hundreds of letters • identifiers and key words and reserved words • blanks can be not significant except in literal character-string data (FORTRAN) or used as separators • delimiters -> begin, end { }
Other elements • Identifiers, operators, key words, reserved words • Free vrs Fixed format - • free written anywhere • fixed - FORTRAN - first five characters are reserved for labels • statements - • simple - no embedding • structured or nested - embedded
Overall Program-Subprogram Structure • Separate subprogram definitions ( Common blocks in FORTRAN) • separate data definitions ( class mechanism) • nested subprogram definitions (Pascal nesting one subprogram in the other) • separate interface definitions - package interface in Ada - in C you can do this with an include file • data descriptions separated from executable statements (COBOL data and environment divisions) • unseparated subprogram divisions - no organization - early BASIC and SNOBOL
Stages in Translation • Process of translation of a program from its original syntax into executable form is central in every programming implementation • translation can be quite simple as in LISP and Prolog but more often quite complex • most languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds
Stages in Translation • Syntactic recognition parts of compiler theory are fairly standard • Analysis of the Source Program • the structure of the program must be laboriously built up character by character during translation • Synthesis of the Object Program • construction of the executable program from the output of the semantic analysis
Structure of a Compiler source program SOURCEPROGRAMRECOGNITIONPHASES Lexical analysis lexical tokens Syntactic analysis Symbol table Other tables parse tree Semantic analysis Object code from other compilations intermediate code Optimization OBJECT CODEGENERATION PHASES optimized intermediate code Object code Executable code Code generation linking
Analysis of the Source Program • lexical analysis (tokenizing) • parsing ( syntactic analysis) • semantic analysis • symbol-table maintenance • insertion of implicit information (default settings) • macro processing and compile-time operations(#ifdefs)
Synthesis of the Object Program • Optimization • code generation - internal representation must be formed into assembly language statements, machine code or other object form • linking and loading - references to external data or other subprograms
Translator Groupings • Crudely grouped by the number of passes they make over the source code • standard - uses 2 passes • decomposes into components, variable name usage • generates an object program from collected information • one pass - fast compilation - Pascal was designed so that it could be done in one pass • three or more passes - if execution speed is paramount
Formal Translation Models
Formal Translation Models • Based on the context-free theory of languages • the formal definition of the syntax of a programming language is called a grammar • a grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined
Chomsky Hierarchy • Language syntax was one of the earliest formal modes to be applied to programming language design • in 1959 Chomsky outlined a model of grammars
Classes of grammar and abstract machines Chomsky Level Grammar Class Machine Class 0 Unrestricted Turning machine 1 Context sensitive Linear-bounded automaton 2 Context free Pushdown automaton 3 Regular Finite-state automaton Type 2 are our BNF grammars. Type 2 and 3 are what we use in programming languages A type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1.
Grammar • To Chomsky it is a 4-tuple (V, T, P, Z) where • V is an alphabet • T in V is an alphabet of terminal symbols • P is a finite set of rewriting rules • Z the distinguished symbol, is a member of T-V • The language of a grammar is the set of terminal strings which can be represented from Z • The difference in the four types is in the form of the rewriting rules allowed in P
Type 0 or phrase structure • Rules can have the form: • u :: = V with • u in V+ and V in V* • That is, the left part u can also be a sequence of symbols and the right part can be empty • abc -> dca • a -> nil
Type 1 or context sensitive or context dependent • Restrict the rewriting rules • xUy ::= xuy • we are only allowed to Rewrite U as u only in the context x…y • all productions a -> b where • the length side a always must be less than or equal to the length of b
G = ( {S,B,C}, {a,b,c}, S, P) • P = • S -> aSBC • S -> abC • bB -> bb • bC -> bc • CB -> BC • cC -> cc • What language is generated by this context sensitive grammar?
Deciding the language? • always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4-tuple definition) • create a tree starting with the start rule and apply the productions finally finishing with all terminals • “generalize” the pattern
S Identifying L given G abC aSBC aabCBC aaSBCBC P = 1. S -> aSBC 2. S -> abC 3. bB -> bb 4. bC -> bc 5. CB -> BC 6. cC -> cc abc aaabCBCBC aabBCC aaabBCCBC aabbCC aaabBCBCC aabbcC aaabBBCCC aabbcc aaabbBCCC aaabbbCCC aaabbbcCC L -> anbncn where n>= 1 aaabbbccC aaabbbccc
Type 2 or context free • U can be rewritten as u regardless of the context in which it appears • This grammar has only one symbol on the left hand side • It also allows a rule to go the empty string
Context Free Expression Grammar • E-> E + T | E - T | T • T -> T * F | T / F | F • F -> number | name | (E)
Type 3 - regular grammars • Restrict the rules once more • all rules must have the form • u :: N or u :: WN
Grammars • As we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complex • type 2 and type 3 became important in programming languages • type 3 provided a model (FSM) for building lexical analyzers • type 2 (BNF) for developing parse trees of programs
BNF Grammars • Consider the structure of an English sentence. We usually describe it as sequence of categories subject / verb / object Examples: The girl/ played / baseball. The boy / cooked / dinner.
BNF Grammars • Each category can be further divided. • For example subject is represented by article noun article / noun / verb / object There are other possible sentence structures besides the simple declarative ones, such as questions. auxiliary verb / subject / predicate Is / the boy / cooking dinner?
Represent sentences by a set of rules • <sentence> ::= <declarative> | <question> • <declarative> ::= <subject> <verb> <object>. • <subject> ::= <article><noun> • <question> ::= <auxiliary verb> <subject> <predicate> • This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the context-free grammar. The BNF and context-free grammar for are equivalent in power; the differences are only in notation. For this reason BNF grammar and context-free grammar are interchangeable. (in grammars)
Syntax • A BNF grammar is composed of a finite set of BNF grammar rules, which define a language • syntax is concerned with form rather than meaning, a (programming) language consists of a set of syntactically correct programs, each of which is simply a sequence of characters
Production Rules • A grammar -> set of production rules • <real-number> ::= <integer_part> . <fraction> • <integer_part> ::= <digit> | <integer_part> <digit> • <fraction> ::= <digit>| <digit> <fraction> • <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 nonterminals Token or terminal
Doesn’t Have to Make Sense! • A syntactically correct program need not make any sense semantically. • If it is executed it would not have to compute anything useful • it could not computer anything at all • For example look at our simple declarative and imperative sentences -> the syntax • subject verb object is fulfilled but doesn’t make any sense The home / ran / the girl.
Parse Trees • Production rules are rules for building strings of tokens • beginning with the starting nonterminal, you can use the rules to build a tree • The parse tree - • each leaf either has a terminal or is empty • nonleaf nodes are with nonterminals • generates the string formed by reading terminals at its leaves from left to right • a string is only in a language if is generated by some parse tree
Parse tree <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 <real-number> String 13.13 . <fraction> <integer_part> <integer_part> <digit> <digit> <fraction> 3 1 <digit> <digit> 1 3
Use of Formal Grammar • Important to the language user and language implementor • user may consult to answer subtle questions about program form, punctuation, and structure • implementor may use it to determine all the possible cases of input program structures that are allowed • common agreed upon definition
BNF grammar or Context free • Assigns a structure to each string in the language • is is always a tree because of the restrictions on BNF grammar rules • parse tree provides an intuitive semantic structure • BNF does a good job in defining the syntax of a language
Syntax not defined by BNF notation • Despite the elegance, power and simplicity of BNF grammars there are areas of language that cannot be expressed (contextual dependence) • ex: the same identifier may not be defined twice in the same scope • also every language can be defined by multiple grammars • problem : ambiguity (the dangling else) • They /are /flying planes or They / are flying/ planes
Ambiguity • Ambiguity is often a property of a given grammar • G : S -> SS | 0 | 1 • the grammar that generates binary strings is ambiguous because there is a string in the language that has two distinct parse trees
Ambiguous Grammar S S S S S S S S S S 0 0 1 0 0 1
Ambiguous Grammar If every grammar for a given language is ambiguous, then the language is inherently ambiguous. However, the language that generates binary string is not because there is a grammar that that is unambiguous G: T -> 0T | 1T | 0 | 1
Expressions • We need control structures for expressions • Implicit (default) control - are in effect unless modified by the programmer through some explicit structure • explicit - modify implicit sequence
Sequencing with Arithmetic Expressions Root = -B B2 - 4 * A * C 2 * A There are 15 separate operations in this formula In a programming language this can be stated as a single expression
Sequencing with Arithmetic Expressions • Expressions are powerful and a natural device for expressing sequences of operations however, they raise new problems. • The sequence-control mechanisms that operate to determine the order of operations within an expression are complex and subtle