100 likes | 161 Views
Notes on Python Regular Expressions and parser generators (by D. Parson). These are the Python supplements to the author’s slides for Chapter 1 and Section 2.1.
E N D
Notes on Python Regular Expressions and parser generators (by D. Parson) • These are the Python supplements to the author’s slides for Chapter 1 and Section 2.1. • http://faculty.kutztown.edu/parson/spring2010/CSC310Spring2010.html has a link to the author’s slides, which are password protected by your K.U. Windows login / password used to access your student account.
Regular Expressions in Python • re module in the optional Python text. • http://docs.python.org/library/re.html • A RE is a pattern in the form of a string. • compile(pattern [, flags]) compiles an RE expression into a finite automaton object. • Return value can be used by other functions. • Flags are for case, multiline, and meta-character options. • search(pattern, string [, flags) searches string for the first match of pattern. • match(pattern, string [, flags) checks at string’s beginning. • Both return a MatchObject or None.
Regular Expressions in Python • split(pattern, string [, maxsplit = 0]) splits string into occurrences of pattern. • Returns a list of strings • sub(pattern, repl, string [, count = 0]) performs substitutions of repl for pattern occurrences. • String and sequence operations are related. • http://docs.python.org/library/string.html • >>> s = "abcde" • >>> dir(s) • ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Python Regular Expression Examples • >>> m1 = search('a+z*(b.d)', 'abcdefghi') • >>> m1 • <_sre.SRE_Match object at 0x11c520> • >>> m1.groups() • ('bcd',) • >>> m1.start() • 0 • >>> m1.end() • 4 • >>> m1.start(0) • 0 • >>> m1.start(1) • 1 # Group 0 is the entire match, 1 is the first parenthesized subexpression, etc.
Learn the major Meta-characters! • Text – verbatim text • . – any character except newline • ^ – matches start of the string (anchor) • $ – matches end of the string • * – Kleene start, 0 or more subpattern repetitions • + – Kleene plus, 1 or more subpattern repetitions • ? – optional, 0 or 1 subpattern occurrence • | – alternation, either left or right subpattern • () – group a subexpression inside parentheses • \ – escape a meta-character (make it normal) • [set of chars], [^set of chars not matched]
More Python RE Examples • >>> m2 = search('a+z*(b.d)', 'Abcde') • >>> m2 • >>> print m2 • None • >>> split(':', "abc:cd:e:f") • ['abc', 'cd', 'e', 'f'] • >>> split('[:]', "abc:cd:e:f") • ['abc', 'cd', 'e', 'f'] • >>> split('[^:]', "abc:cd:e:f") • ['', '', '', ':', '', ':', ':', '']
More Python RE Examples (sub) • >>> sub('a([^b]+)b', 'A\\1B', 'a123b45ab67a9b aab') • 'A123B45ab67A9B AaB' • The parenthesized subexpression matches one or more occurrences of anything except for b. • The matched substring of the first parenthesized subexpression is group 1. • The replacement pattern \1 says “insert group 1 at this point.” • Effect is to re-insert characters between a and b.
Finite State Automata • A regular expression compiler translates a regular expression into a finite state automaton. • This could be a linked data structure or code. It looks like a graph of mapping steps needed for the regular expression. • There are nondeterministic and deterministic flavors. • (a|b)c+d is a simple example expression. c c a s1 ε start s3 s4 accept b c d s2
Lookahead 1 types of parsers. • LL(1) and LR(1) grammars require a parser to get at most 1 look-ahead terminal from the scanner. • LL(1) cannot handle left-recursive grammar productions. It can handle other recursion. • LR(1) and its variants can handle left, right and nested recursion; left is the most efficient. • A generated parser is essentially a deterministic finite state automaton that uses a stack to keep track of nested syntactic structures. • This topic is covered exhaustively in compiler design.
Parser generators in Python. • YAPPS2 is an LL(1) parser generator. • http://theory.stanford.edu/~amitp/yapps/ • http://pypi.python.org/pypi/Yapps2 • PLY is a Python LALR(1) (subset of LR(1)) equivalent to UNIX YACC and GNU Bison that are used to generate compilers for C code. • http://www.dabeaz.com/ply/ • Both generate Python executable parsers from stylized Python code.