440 likes | 561 Views
Announcements. All groups have been assigned Homework: By this evening email everyone in your group and set up a meeting time to discuss project 4 Project 4 will be released tomorrow You will have roughly 3 weeks to work on it. How do I work in a team?. Communication
E N D
Announcements • All groups have been assigned • Homework: • By this evening email everyone in your group and set up a meeting time to discuss project 4 • Project 4 will be released tomorrow • You will have roughly 3 weeks to work on it
How do I work in a team? • Communication • Teams that do not communicate well do poorly on the project • Understanding the assignment • Teams that sit down and go over the assignment together do well • Battle plan • Outline the project in your own English text • Code together • Difficult parts of the project are best done together
Parsing Text • The vast majority of the information present on the internet is in text form • Data, webpages, etc • We want to transform the data into a more usable form • Examples we have seen thus far: • Encoding of a matrix • Encoding of a tree • Project 3, changing text (encrypting and decrypting)
Example: Finding a nucleotide sequence • We can find DNA sequences of parasites on the internet (typically in databases) • Problem: we want to know if a sequence of nucleotides is in a particular parasite • We not only want to know “yes” or “no” but which parasite
What the data looks like >Schisto unique AA825099 gcttagatgtcagattgagcacgatgatcgattgaccgtgagatcgacga gatgcgcagatcgagatctgcatacagatgatgaccatagtgtacg >Schisto unique mancons0736 ttctcgctcacactagaagcaagacaatttacactattattattattatt accattattattattattattactattattattattattactattattta ctacgtcgctttttcactccctttattctcaaattgtgtatccttccttt
How are we going to do it? • First, we get the sequences in a big string. • Next, we find where the small subsequence is in the big string. • From there, we need to work backwards until we find “>” which is the beginning of the line with the sequence name. • From there, we need to work forwards to the end of the line. From “>” to the end of the line is the name of the sequence • Yes, this is hard to get right.
Lets Review Some Python • string.find(sub) – returns the lowest index where the substring sub is found or -1 • string.find(sub, start) – same as above, except using the slice string[start:] • string.find(sub, start, end) – same as above, except using the slice string[start:end]
Lets Review Some Python • string.rfind(sub) – returns the highest index where the substring sub is found or -1 • string.rfind(sub, start) – same as above, except using the slice string[start:] • string.rfind(sub, start, end) – same as above, except using the slice string[start:end]
Clicker Question: are these programs equivalent? • String = “two plus two is four” 1 2 String.find(“two”) String.rfind(“two”) A: yes B: no
deffindSequence(seq): sequencesFile = "parasites.txt” file = open(sequencesFile,”r") sequences = file.read() file.close() seqloc = sequences.find(seq) if seqloc!=-1: # Now, find the ">" with the name of the sequence nameloc = sequences.rfind(">",0,seqloc) # using rfind() here!! endline = sequences.find("\n",nameloc) print ("Found in ",sequences[nameloc:endline]) else: print ("Not found”)
Why -1? • If .find or .rfind don’t find something, they return -1 • If they return 0 or more, then it’s the index of where the search string is found. • Note: last week we saw the urlib module • It contains a method that lets you download a file from the internet • How might you modify your program to first download the file from the internet prior to opening it?
Running the program >>> findSequence("tagatgtcagattgagcacgatgatcgattgacc") Found in >Schisto unique AA825099 >>> findSequence("agtcactgtctggttgaaagtgaatgcttccaccgatt") Found in >Schisto unique mancons0736
One More Note on Parsing • We saw how to read a file as a string or list of strings • We saw how to leverage how data was structured to find specific information we were interested in • What if there are many pieces we want to extract?
Revisiting Split • String.split(delimiter) break the string String into parts, separated by the delimiter • print (“a b c d”.split(“ “)) Would print: [‘a’, ‘b’, ‘c’, ‘d’] • Some quirky cases for string.split() • Explained in pre lab 10
Why is this useful? • When reading in a file, we may have many interesting data items on a given line (or in the file) • Example: Lab 10
How to glue everything together • Step 1) get some interesting data • Step 2) open the file • Step 3) read the data from the file, either as one large string or a list of strings • Step 4) break this string (or list of strings) into the data we want (rfind, find, split)
Abstract Example • Getting values from a text file • str = file.read() • Lines = str.split(‘\n’) list of strings • for element in Lines: items = element.split(‘ ‘) list of strings
Concrete Example foo = "bab cad eag” elem= foo.split(" ”) for i in elem: print(i.split("a")) ['b', 'b'] ['c', 'd'] ['e', 'g']
CQ:How can I parse all the words in a file? • Assume we have read the file in as one big string (we used file.read()) and the file contains no punctuation • A) first split on “\n” and for each element in the result, we split on “ “ • B) only split on “ “
Concrete Clicker Example file = open(“text.txt”, “r”) content = file.read() line = content.split(“\n”) for i in line: print(i.split(“ ")) [‘This', ‘is'] [’a’, ‘file’] text.txt This is a file
Example: Get the temperature • The weather is always available on the Internet. • Can we write a function that takes the current temperature out of a source like • http://www.ajc.com/weather or • http://www.weather.com?
The Internet is mostly text • Web pages are actually text in the format called HTML (HyperText Markup Language) • HTML isn’t a programming language,it’s an encoding language. • It defines a set of meanings for certain characters, but one can’t program in it. • We can ignore the HTML meanings for now, and just look at patterns in the text.
The word “temperature”doesn’t really show up. But the temperature always follows the word “Currently”, and always comes before the “<b>°</b>” <td ><img src="/shared-local/weather/images/ps.gif" width="48" height="48" border="0"><font size=-2><br></font><font size="-1" face="Arial, Helvetica, sans-serif"><b>Currently</b><br> Partly sunny<br> <font size="+2">54<b>°</b></font><font face="Arial, Helvetica, sans-serif" size="+1">F</font></font></td> </tr> Where’s the temperature?
We can use the same algorithm we’ve seen previously • Grab the content out of a file in a big string. • We’vesaved the HTML page previously. • We‘ve seen how to grab it directly. • Find the starting indicator (“Currently”) • Find the ending indicator (“<b>°”) • Read the previous characters
def findTemperature(): weatherFile = "ajc-weather.html” file = open(weatherFile,”r") weather = file.read() file.close() # Find the Temperature curloc = weather.find("Currently") if curloc <> -1: # Now, find the "<b>°" following the temp temploc = weather.find("<b>°",curloc) tempstart = weather.rfind(">",0,temploc) print ("Current temperature:”,weather[tempstart+1:temploc]) if curloc == -1: print (”Can't find the temp”)
Homework • Email your group members • Read through the project 4 description when it becomes available
Dictionaries in Python • Useful Analogy: an actual Dictionary! • English dictionaries provide an association between a Word and a Definition • We us the Word to look up the Definition • Given a definition it would be very hard to look up the word
Dictionaries Python • Much like a dictionary for the English language, python dictionaries create an association between a key and a value • Key corresponds to a Word in our analogy • Value corresponds to a Definition
Dictionary Syntax • A dictionary is a collection of elements • Each element is a key/value key : value • Just like a list is defined by [ ] a dictionary is defined by { } {‘key1’:value1, ‘key2’:value2, ‘key3’:value3}
Keys • A key can be any immutable type (we will consider two types) • Strings and Integers • Much like the [index] is used to select out an element from a list, for a dictionary we use [key] A = {‘key1’:value1, ‘key2’:value2, ‘key3’:value3} print(A[‘key2’])
Example: Simple Phone Book • phoneBook = {‘Luke’ : ’123 4567’, ‘Dr. Martino’ : ‘456 7890’} names are keys, phone numbers are values def lookup(key): return phoneBook[key] lookup(‘Dr. Martino’)
Clicker Question: are these programs equivalent? 1 2 A = [‘mike’, ‘mary’, ‘marty’] print A[1] A = {0:’mike’, 1:’mary’, 2:’marty’} print A[1] A: yes B: no
Clicker Question: are these programs equivalent? 1 2 A = [‘mike’, ‘mary’, ‘marty’] print A[1] A = {1:’mary’, 2:’marty’, 0:’mike’} print A[1] A: yes B: no
Key Differences from Lists • Lists are ordered • Index is implicit based on the list ordering • Dictionaries are unordered • Keys are specified and do not depend on order • Lists are useful for storing ordered data, dictionaries are useful for storing relational data • Motivating example from book: databases!
Updating a Dictionary • Much like a list we can assign to a dictionary Abstract: dictionary[key] = newValue Concrete Example: A = {0:’mike’, 1:’mary’, 2:’marty’} print A[1] A[1] = ‘alex’ print A[1]
Adding to a Dictionary • Much like a list we can append to a dictionary Abstract: dictionary[newKey] = newValue Concrete Example: A = {0:’mike’, 1:’mary’, 2:’marty’} print A[1] A[3] = ‘alex’ print A {0:’mike’, 1:’mary’, 2:’marty’, 3:’alex’}
Clicker Question: What is the output of this code? A = {0:’mike’, 1:’mary’, 2:’marty’, ‘marty’:2, ‘mike’:0, ‘mary’:1} A[3] = ‘mary’ A[‘mary’] = 5 A[2] = A[0] + A[1] A: {'mike': 0, 'marty': 2, 3: 'mary', 'mary': 5, 2: 'mikemary', 1: 'mary', 0: 'mike'} B: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2: 'mikemary', 1: 'mary', 0: 'mike'} C: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2:1, 1: 'mary', 0: 'mike'}
Printing a Dictionary A = {0:'mike', 1:'mary', 2:'marty’} for k in A: printk Prints: 2 1 0 A = {0:'mike', 1:'mary', 2:'marty’} fork,v in A.iteritems(): print k, ":", v Prints: 2 : marty 1 : mary 0 : mike
Project 4: Frequency Analysis Intuition • We can leverage a dictionary to calculate the number of times a particular letter occurs in a message • We can use characters as the keys • The number of times that character occurs is the value • Increment the value each time we see a character • Initially the value starts at 0
Some Additional Notation:Pairs in Python • We can create pairs in python • Example: tuple = (‘name’, 3) • Such pairs are called tuples (see page 291) • Tuples support the [] for selecting their elements • Tuples are immutable (like strings) • Further reading (section 5.3): • http://docs.python.org/tutorial/datastructures.html#tuples-and-sequences
Tuples • We can think of tuples as an immutable list • They do not support assignment • Example: A = (‘me’, 5, 32, ‘joe’) print A[0] print A[3] A[2] = 4 <--- this throws an error
Creating a dictionary from a list • Python provides the dict function to create a dictionary out of a list of pairs Example: dict([(0, ‘mike’),(1, ‘mary’),(2, ‘marty’)]) • Why do I care? • We can leverage list creation short cuts to populate dictionaries! Example: dict([(x, x**2) for x in range(10)])