200 likes | 220 Views
Learn about parsing text, extracting relevant information, and using regular expressions for various applications. Explore parsing web pages, computer players for word games, and algorithms to extract specific data. Dive into a detailed guide on parsing HTML and creating abstract data structures. Master the art of parsing for practical applications.
E N D
Parsing • Analyze text: split it into meaningful units, tokens • Extract relevant information, disregard irrelevant information • ‘Meaningful’, ‘relevant’ depend on application: what are we looking for? • Phone book: meaningful tokens are words, numbers • Search phone book for all people named “Ole Hansen” • Search phone book for phone numbers starting with 86
Parsing using regular expressions: Torleif game Sort of like Master Mind with words and letters: • Two players, each finds secret 5-letter noun • Take turns in guessing • Score each guess by reporting • Number of correctly placed letters • Number of incorrectly placed letters also present in secret word sport trofæ 1 correct, 2 incorrect frygt 1 correct, 1 incorrect Note: opponent is not told which letters are correct/incorrect
Let’s write a computer player: • Pick random word (from homepage of Dansk Sprognævn). • Ask for a guess • Was the guess correct? • Otherwise score the guess • Go to 2.
Dansk Sprognævn, dictionary web page Ask for all words starting with .. Page displays at most 50 words at a time We are looking for 5-letter strings in bold followed by the string sb in italics (Danish substantiv = noun)
Parsing the web page The source HTML of the web page has 370 lines. Some of it looks like this: <META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary, orthography, Dansk Sprognævn"> <LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css"> <SCRIPT language="JavaScript" type="text/javascript"> if (document.searchForm && document.searchForm.P) src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> <TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390> <TR BGCOLOR="#d0e0d0"><TD> <B>spondæisk </B><I>adj., itk. d.s.</I> </TD></TR> <<TR><TD> <B>sporstof </B><I>sb., </I>-fet, -fer. </TD></TR> <TR BGCOLOR="#d0e0d0"><TD> <B>sport </B><I>sb., </I>-en, <I>i sms. </I>sports-, <I>fx </I>sportsstævne. </TD></TR> </HTML> Bold tag, 5-letter string, space, bold end-tag, italics tag, sb
Algorithm for picking a random word • Pick a random initial letter x (weighted – count total number of words beginning with each letter) • Pick random index in the list of all words starting with x • Ask website for webpage with next 50 x-words starting at chosen index • Parse webpage and look for first 5-letter noun • If none is found, ask for next 50 (wrap-around)
This URL and these parameters are needed to retrieve the html vægur sepia crack lampe afart hoppe kinin rubin havre garde get_random_word.py (part 2)
Game program ? sport sport 1c 1i ? stang stang 1c 2i sport 1c 1i ? satin satin 3c 0i stang 1c 2i sport 1c 1i ? salon salon 5c 0i satin 3c 0i stang 1c 2i sport 1c 1i torleif.py
Two pass-parsing Evolutionary tree of life (animal kingdom) • Huge hierarchy of groups and subgroups • Each node in the tree has a name and a (possibly empty) list of descendant trees (sons) Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.
Abstract data structure to represent a tree (phylogeny) phylogeny.py
How can we write a tree to a sequential file? • Informally: • A tree is a labeled node containing a (possibly empty) list of other trees • Write tree node using start and end tags: <N=“Insects”> [sons] </N> • Formally (context-free grammar): T→ <N=“L”>S</N> S→λ| TS L→string label Insects Flies Beetles B C A E D
Recursive method for string representing of tree Insects This method called when printing a Phylogeny_node object Flies Beetles B C A E D phylogeny.py First obtain string representation of sons (empty string if no sons) by calling function recursively.. .. then create string with start tag, label, sons’ representation, and end tag .. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..
Larger tree – How can we read a tree from a sequential file? <N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N> part_of_the_tree_of_life.txt We need a parser!
Two-pass parsing Complex parsing is often split in two passes: • Lexical analysis • Identify and assemble tokens: logical units of text • Structural analysis • Determine the structural hierarchy of the tokens
Lexical analysis Match either a start tag or an end tag Define a group containing the start tag’s label phylogenyparser.py Search text from index pointer Create token of right type Move index pointer
Structural analysis 1 2 3 phylogenyparser.py .. <N="Kangaroo"></N><N="Koala"></N>.. current_node current_node current_node new_node Kangaroo 1 2 3
Terrestrial vertebrates Turtle Testudines Reptilia Lizard Synapsida Lepidosauromorpha Diapsida Snake Therapsida Archosauromorpha Mammalia Owl Kangaroo Eutheria Penguin Marsupilia Tyrannosaurus Proboscidea Koala Primates Elephant Carnivora Human Wolf Gorilla Walrus Chimpanzee
Testprogram phylogenyparser.py
Navigating in the tree Name: Diapsida Father: Reptilia Siblings: Testudines Sons: Archosauromorpha Lepidosauromorpha (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? b Number of sibling (0-0)? 0 Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p <N="Testudines"><N="Turtle"></N></N> Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f Name: Reptilia Father: Terrestrial vertebrates Siblings: Synapsida Sons: Diapsida Testudines Turtle Testudines Reptilia Lepidosauromorpha Diapsida Archosauromorpha