Notes on Python Regular Expressions and parser generators (by D. Parson)

Notes on Python Regular Expressions and parser generators (by D. Parson) • These are the Python supplements to the author’s slides for Chapter 1 and Section 2.1. • http://faculty.kutztown.edu/parson/spring2010/CSC310Spring2010.html has a link to the author’s slides, which are password protected by your K.U. Windows login / password used to access your student account.

Regular Expressions in Python • re module in the optional Python text. • http://docs.python.org/library/re.html • A RE is a pattern in the form of a string. • compile(pattern [, flags]) compiles an RE expression into a finite automaton object. • Return value can be used by other functions. • Flags are for case, multiline, and meta-character options. • search(pattern, string [, flags) searches string for the first match of pattern. • match(pattern, string [, flags) checks at string’s beginning. • Both return a MatchObject or None.

Regular Expressions in Python • split(pattern, string [, maxsplit = 0]) splits string into occurrences of pattern. • Returns a list of strings • sub(pattern, repl, string [, count = 0]) performs substitutions of repl for pattern occurrences. • String and sequence operations are related. • http://docs.python.org/library/string.html • >>> s = "abcde" • >>> dir(s) • ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

Python Regular Expression Examples • >>> m1 = search('a+z*(b.d)', 'abcdefghi') • >>> m1 • <_sre.SRE_Match object at 0x11c520> • >>> m1.groups() • ('bcd',) • >>> m1.start() • 0 • >>> m1.end() • 4 • >>> m1.start(0) • 0 • >>> m1.start(1) • 1 # Group 0 is the entire match, 1 is the first parenthesized subexpression, etc.

Learn the major Meta-characters! • Text – verbatim text • . – any character except newline • ^ – matches start of the string (anchor) • $ – matches end of the string • * – Kleene start, 0 or more subpattern repetitions • + – Kleene plus, 1 or more subpattern repetitions • ? – optional, 0 or 1 subpattern occurrence • | – alternation, either left or right subpattern • () – group a subexpression inside parentheses • \ – escape a meta-character (make it normal) • [set of chars], [^set of chars not matched]

More Python RE Examples • >>> m2 = search('a+z*(b.d)', 'Abcde') • >>> m2 • >>> print m2 • None • >>> split(':', "abc:cd:e:f") • ['abc', 'cd', 'e', 'f'] • >>> split('[:]', "abc:cd:e:f") • ['abc', 'cd', 'e', 'f'] • >>> split('[^:]', "abc:cd:e:f") • ['', '', '', ':', '', ':', ':', '']

More Python RE Examples (sub) • >>> sub('a([^b]+)b', 'A\\1B', 'a123b45ab67a9b aab') • 'A123B45ab67A9B AaB' • The parenthesized subexpression matches one or more occurrences of anything except for b. • The matched substring of the first parenthesized subexpression is group 1. • The replacement pattern \1 says “insert group 1 at this point.” • Effect is to re-insert characters between a and b.

Finite State Automata • A regular expression compiler translates a regular expression into a finite state automaton. • This could be a linked data structure or code. It looks like a graph of mapping steps needed for the regular expression. • There are nondeterministic and deterministic flavors. • (a|b)c+d is a simple example expression. c c a s1 ε start s3 s4 accept b c d s2

Lookahead 1 types of parsers. • LL(1) and LR(1) grammars require a parser to get at most 1 look-ahead terminal from the scanner. • LL(1) cannot handle left-recursive grammar productions. It can handle other recursion. • LR(1) and its variants can handle left, right and nested recursion; left is the most efficient. • A generated parser is essentially a deterministic finite state automaton that uses a stack to keep track of nested syntactic structures. • This topic is covered exhaustively in compiler design.

Parser generators in Python. • YAPPS2 is an LL(1) parser generator. • http://theory.stanford.edu/~amitp/yapps/ • http://pypi.python.org/pypi/Yapps2 • PLY is a Python LALR(1) (subset of LR(1)) equivalent to UNIX YACC and GNU Bison that are used to generate compilers for C code. • http://www.dabeaz.com/ply/ • Both generate Python executable parsers from stylized Python code.

Notes on Python Regular Expressions and parser generators (by D. Parson)

Notes on Python Regular Expressions and parser generators (by D. Parson)

Presentation Transcript

Regular Expressions

Parser Generators

Regexes vs Regular Expressions; and Recursive Descent Parser

Python Regular Expressions

Python regular expressions

Regular Expressions

Regular Expressions in Python

Introduction to Python and Regular Expressions in Python

Regular Expressions

Regular Expressions

Notes on Python Regular Expressions and parser generators (by D. Parson)

Regular Expressions ( in Python)

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Python regular expressions

Python regular expressions

Python for NLP Regular Expressions