Syntax and Semantics

Syntax and Semantics • Syntax gives the structure of statements in a language • Allowed ordering, nesting, repetition, omission of symbols • Can automate the process of checking correct syntax • Semantics give meaning to the (structured) symbols • E.g., kinds of labels, types of variables, layout of classes • For example, what does 11 mean? (at least 3 good answers) • Separating syntactic and semantic evaluation helps • Can isolate the problem of syntactic recognition in an engine • Can use the structure produced by the engine directly • Sometimes called syntax-directed (compiler in charge)

Syntax and Lexical Structure • Syntax gives the structure of statements in a language • E.g., the format of tokens and how they can be arranged • Lexical structure also describes how to recognize them • Scanning obtains tokens from a stream characters • E.g., whitespace delimited vs. regular-expression based • Tokens include keywords, constants, symbols, identifiers • Usually based on assumption of taking longest substring • Parsing recognizes more complex expressions • E.g., well-formed statements in logic, arithmetic, etc. • Free-format languages ignore indentation, etc. while fixed format languages have specific restrictions/requirements

Scanning vs. Parsing Roles • It is often possible to simplify a grammar’s structure by making its tokens more sophisticated • For example, scanning for the terminal token NUMBER vs. parsing for the non-terminal number → nonzerodigit digit* • Such simplification delegates work to a scanner • Often this is a good separation of concerns, especially since scanning may appropriately specialize it logic, etc. • E.g., a fairly general scanner built from classification functions (which look for all digits, all alphabetic, etc.) can be re-used or refactored easily for scanning different grammars • E.g., the C++11 <regex> library is worth studying and using

Regular Expressions, DFAs, NDFAs • Regular expressions capture lexical structure of symbols that can be built using 3 composition rules • Concatenation (ab) , selection (a | b), repetition (b*) • Finite automata can recognize regular expressions • Deterministic finite automata (DFAs) associate a unique state with each sequence generated by a regular expression • Non-deterministic finite automata (NDFAs) let multiple states to be reached by the same input sequence (adding “choice”) • Can generate a unique (minimal) DFA in 3 steps • Generate NDFA from the regular expression (Scott pp. 56) • Convert NDFA to (possibly larger) DFA (Scott pp. 56-58) • Minimize the DFA (Scott pp. 59) to get a unique automaton • C++11 <regex> library automates all this for you

Today’s Studio Exercises • We’ll code up some ideas from Scott Chapter 2.1-2.2 • Looking at mechanisms for recognizing tokens and for parsing basic CFGs with straightforward recursion • Next studio we’ll look at more complicated variations • Today’s exercises are all in C++ • We’ll write our own code, but check out the <regexp> library too, since you’ll be allowed to use it for lab assignments! • Please take advantage of the on-line tutorial and reference manual pages that are linked on the course web site • As always, please ask us for help as needed • When done, email your answers to the course account with “Syntax Studio I” in the subject line

Syntax and Semantics