1 / 30

Text Parsing in Python

Text Parsing in Python. - Gayatri Nittala - Madhubala Vasireddy. Text Parsing. The three W’s! Efficiency and Perfection. What is Text Parsing?. common programming task extract or split a sequence of characters. Why is Text Parsing?. Simple file parsing

soleary
Download Presentation

Text Parsing in Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Parsing in Python - Gayatri Nittala - Madhubala Vasireddy

  2. Text Parsing • The three W’s! • Efficiency and Perfection

  3. What is Text Parsing? • common programming task • extract or split a sequence of characters

  4. Why is Text Parsing? • Simple file parsing • A tab separated file • Data extraction • Extract specific information from log file • Find and replace • Parsers – syntactic analysis • NLP • Extract information from corpus • POS Tagging

  5. Text Parsing Methods • String Functions • Regular Expressions • Parsers

  6. String Functions • String module in python • Faster, easier to understand and maintain • If you can do, DO IT! • Different built-in functions • Find-Replace • Split-Join • Startswith and Endswith • Is methods

  7. Find and Replace • find, index, rindex, replace • EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = string.find(line, stext) if lineno >0: line =line.replace(stext, rtext) sys.stdout.write(line)

  8. startswith and endswith • Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ • Find if the sentences are interrogative or exclamative • What an amazing game that was! • Do you like this? endings = ('!', '?') sentence.endswith(endings)

  9. isMethods • to check alphabets, numerals, character case etc • m = 'xxxasdf ‘ • m.isalpha() • False

  10. Regular Expressions • concise way for complex patterns • amazingly powerful • wide variety of operations • when you go beyond simple, think about regular expressions!

  11. Real world problems • Match IP Addresses, email addresses, URLs • Match balanced sets of parenthesis • Substitute words • Tokenize • Validate • Count • Delete duplicates • Natural Language processing

  12. RE in Python • Unleash the power - built-in re module • Functions • to compile patterns • complie • to perform matches • match, search, findall, finditer • to perform opertaions on match object • group, start, end, span • to substitute • sub, subn • - Metacharacters

  13. Compiling patterns • re.complile() • pattern for IP Address • ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ • ^\d+\.\d+\.\d+\.\d+$ • ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ • ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$

  14. Compiling patterns • pattern for matching parenthesis • \(.*\) • \([^)]*\) • \([^()]*\)

  15. Substitute • Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate

  16. Count • Split and count words in the given text • p = re.compile(r'\W+') • len(p.split('This is a test for split().'))

  17. Tokenize • Parsing and Natural Language Processing • s = 'tokenize these words' • words = re.compile(r'\b\w+\b|\$') • words.findall(s) • ['tokenize', 'these', 'words']

  18. Common Pitfalls • operations on fixed strings, single character class, no case sensitive issues • re.sub() and string.replace() • re.sub() and string.translate() • match vs. search • greedy vs. non-greedy

  19. PARSERS • Flat and Nested texts • Nested tags, Programming language constructs • Better to do less than to do more!

  20. Parsing Non flat texts • Grammar • States • Generate tokens and Act on them • Lexer - Generates a stream of tokens • Parser - Generate a parse tree out of the tokens • Lex and Yacc

  21. Grammar Vs RE • Floating Point #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"

  22. Grammar Vs RE pat = r'''(?x) ( # exponentfloat ( # intpart or pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat | \d+ # intpart ) # end intpart or pointfloat [eE][+-]?\d+ # exponent ) # end exponentfloat | ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat '''

  23. PLY - The Python Lex and Yacc • higher-level and cleaner grammar language • LALR(1) parsing • extensive input validation, error reporting, and diagnostics • Two moduoles lex.py and yacc.py

  24. Using PLY - Lex and Yacc • Lex: • Import the [lex] module • Define a list or tuple variable 'tokens', the lexer is allowed to produce • Define tokens - by assigning to a specially named variable ('t_tokenName') • Build the lexer • mylexer = lex.lex() • mylexer.input(mytext) # handled by yacc

  25. Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Integer value too large", t.value t.value = 0 return t t_ignore = " \t"

  26. Yacc • Import the 'yacc' module • Get a token map from a lexer • Define a collection of grammar rules • Build the parser • yacc.yacc() • yacc.parse('x=3')

  27. Yacc • Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' print p[1]

  28. Summary • String Functions A thumb rule - if you can do, do it. • Regular Expressions Complex patterns - something beyond simple! • Lex and Yacc Parse non flat texts - that follow some rules

  29. References • http://docs.python.org/ • http://code.activestate.com/recipes/langs/python/ • http://www.regular-expressions.info/ • http://www.dabeaz.com/ply/ply.html • Mastering Regular Expressions by Jeffrey E F. Friedl • Python Cookbook by Alex Martelli, Anna Martelli & David Ascher • Text processing in Python by David Mertz

  30. Thank YouQ & A

More Related