230 likes | 240 Views
This article explores the reasons for separating the scanner and parser in programming language syntax. It delves into the computation involved in the scanner and provides solutions for implementing an ad hoc scanner based on automaton. The steps for generating a scanner using finite automata and optimizing it are also discussed.
E N D
Programming Language Syntax 3 http://flic.kr/p/zCyMp
Why separate scanner and parser? ANTLR generates for you … • Parser much more computationally intensive than scanner • Scanner considerably reduces number of items that parser must inspect But what computation is involved in scanner? That’s today’s topic…
Complete this ad hoc scanner Tokens
Here’s a solution Tokens
Ad hoc scanner implementations common for production languages • Fast, compact code • But… • Finite automata can be generated automatically from a set of regular expressions • Good for developing languages • Easy to regenerate scanner
Calculator language scanner automaton • Note that this is a deterministic finite automaton (DFA) • Only ever one possible transition for an input character
Three steps generate scanner • Generate nondeterministic finite automaton (NFA) • Multiple transition out of state for same character • Epsilon transitions (ε) • Convert NFA to DFA • No need to search all paths in DFA • Optimize by minimizing states in DFA
Construct an NFA for this regex Solve in this order . d d* .d d. .d|d.
NFA to DFA conversion • “Set of subsets” construction • State of DFA after reading given input represents set of states NFA might have reached • Example: Start 1, 2, 4 d 2, 3, 4
Minimizing the DFA Steps: • Merge all non-final states into a single state and merge all final states into a single state (expect ambiguity) • For each ambiguous input, split states back to their original division until input is no longer ambiguous
Minimizing the DFAStep 1: Merge state into non-final and final
Minimizing the DFAStep 2 (repeated): Disambiguate by splitting First, we disambiguate d Now, how to disambiguate “.”?
Minimizing the DFAStep 2 (repeated): Disambiguate by splitting
How implement scanner based on automaton? Two common approaches: • Nested case statements • Tend to be ad hoc • Table and driver • Tend to be generated
Nested-case approach Outer cases handle states Inner cases handle transitions (set a new state) Note: Look-ahead may be necessary to accept longest possible token
Table and driver approach Good news! We’ll let ANTLR handle the implementing!
What’s next? • Homework 1 due next class!