310 likes | 425 Views
Text Parsing in Python. - Gayatri Nittala - Madhubala Vasireddy. Text Parsing. The three W’s! Efficiency and Perfection. What is Text Parsing?. common programming task extract or split a sequence of characters. Why is Text Parsing?. Simple file parsing
E N D
Text Parsing in Python - Gayatri Nittala - Madhubala Vasireddy
Text Parsing • The three W’s! • Efficiency and Perfection
What is Text Parsing? • common programming task • extract or split a sequence of characters
Why is Text Parsing? • Simple file parsing • A tab separated file • Data extraction • Extract specific information from log file • Find and replace • Parsers – syntactic analysis • NLP • Extract information from corpus • POS Tagging
Text Parsing Methods • String Functions • Regular Expressions • Parsers
String Functions • String module in python • Faster, easier to understand and maintain • If you can do, DO IT! • Different built-in functions • Find-Replace • Split-Join • Startswith and Endswith • Is methods
Find and Replace • find, index, rindex, replace • EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = string.find(line, stext) if lineno >0: line =line.replace(stext, rtext) sys.stdout.write(line)
startswith and endswith • Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ • Find if the sentences are interrogative or exclamative • What an amazing game that was! • Do you like this? endings = ('!', '?') sentence.endswith(endings)
isMethods • to check alphabets, numerals, character case etc • m = 'xxxasdf ‘ • m.isalpha() • False
Regular Expressions • concise way for complex patterns • amazingly powerful • wide variety of operations • when you go beyond simple, think about regular expressions!
Real world problems • Match IP Addresses, email addresses, URLs • Match balanced sets of parenthesis • Substitute words • Tokenize • Validate • Count • Delete duplicates • Natural Language processing
RE in Python • Unleash the power - built-in re module • Functions • to compile patterns • complie • to perform matches • match, search, findall, finditer • to perform opertaions on match object • group, start, end, span • to substitute • sub, subn • - Metacharacters
Compiling patterns • re.complile() • pattern for IP Address • ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ • ^\d+\.\d+\.\d+\.\d+$ • ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ • ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$
Compiling patterns • pattern for matching parenthesis • \(.*\) • \([^)]*\) • \([^()]*\)
Substitute • Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate
Count • Split and count words in the given text • p = re.compile(r'\W+') • len(p.split('This is a test for split().'))
Tokenize • Parsing and Natural Language Processing • s = 'tokenize these words' • words = re.compile(r'\b\w+\b|\$') • words.findall(s) • ['tokenize', 'these', 'words']
Common Pitfalls • operations on fixed strings, single character class, no case sensitive issues • re.sub() and string.replace() • re.sub() and string.translate() • match vs. search • greedy vs. non-greedy
PARSERS • Flat and Nested texts • Nested tags, Programming language constructs • Better to do less than to do more!
Parsing Non flat texts • Grammar • States • Generate tokens and Act on them • Lexer - Generates a stream of tokens • Parser - Generate a parse tree out of the tokens • Lex and Yacc
Grammar Vs RE • Floating Point #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
Grammar Vs RE pat = r'''(?x) ( # exponentfloat ( # intpart or pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat | \d+ # intpart ) # end intpart or pointfloat [eE][+-]?\d+ # exponent ) # end exponentfloat | ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat '''
PLY - The Python Lex and Yacc • higher-level and cleaner grammar language • LALR(1) parsing • extensive input validation, error reporting, and diagnostics • Two moduoles lex.py and yacc.py
Using PLY - Lex and Yacc • Lex: • Import the [lex] module • Define a list or tuple variable 'tokens', the lexer is allowed to produce • Define tokens - by assigning to a specially named variable ('t_tokenName') • Build the lexer • mylexer = lex.lex() • mylexer.input(mytext) # handled by yacc
Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Integer value too large", t.value t.value = 0 return t t_ignore = " \t"
Yacc • Import the 'yacc' module • Get a token map from a lexer • Define a collection of grammar rules • Build the parser • yacc.yacc() • yacc.parse('x=3')
Yacc • Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' print p[1]
Summary • String Functions A thumb rule - if you can do, do it. • Regular Expressions Complex patterns - something beyond simple! • Lex and Yacc Parse non flat texts - that follow some rules
References • http://docs.python.org/ • http://code.activestate.com/recipes/langs/python/ • http://www.regular-expressions.info/ • http://www.dabeaz.com/ply/ply.html • Mastering Regular Expressions by Jeffrey E F. Friedl • Python Cookbook by Alex Martelli, Anna Martelli & David Ascher • Text processing in Python by David Mertz