720 likes | 1.39k Views
Python Strings chapter 8 . From Think Python How to Think Like a Computer Scientist. Strings. A string is a sequence of characters. You may access the individual characters one at a time with the bracket operator. >>> name = ‘Simpson’ >>> FirstLetter = name[0]
E N D
PythonStringschapter 8 From Think Python How to Think Like a Computer Scientist
Strings • A string is a sequence of characters. You may access the individual characters one at a time with the bracket operator. • >>> name = ‘Simpson’ • >>> FirstLetter = name[0] • ‘Simpson’ • name[0] name[1] name[4] name[-1]==name[6] • name[len(name)-1] • Also remember that len(name) is 7 #number of characters
Traversing a string several ways • name =‘Richard Simpson’ • index=0 • while index < len(name): • letter = name[index] • print letter • index = index + 1 for i in range(len(name)): letter = name[i] print letter for char in name: print char for i in range(len(name)): print name[i] This make sense?
Concatenation • #The + operator is used to concat #two strings together • first=‘Monty’ • second = ‘Python’ • full = first+second • print full • MontyPython #Reversing a string word = 'Hello Monty' rev_word = '' for char in word: rev_word = char + rev_word print rev_word ytnoMolleH
String Slices • A slice is a connect subsegment(substring) of a string. • s = ‘Did you say shrubberies? ‘ • a= s[0:7] slice from 0 to 6 ( not including 7) • a is ‘Did you’ • b= s[8:11] slice from 8 to 10 • b is ‘say’ • c=s[12:] slice from 12 to end (returns a suffix) • is ‘shrubberies’ • d=s[:3] from 0 to 2 (returns a prefix) • is ‘Did’
Strings are immutable • You can only build new strings, you CANNOT modify and existing one. Though you can redefine it. For example • name = ‘Superman’ • name[0]=‘s’ Will generate an error • name = ‘s’+name[1:] this would work • print name • superman
Methods vrs functions • type.do_something() # here do_something is a method • ‘Hello’.upper() # returns ‘HELLO’ • value.isdigit() # returns True if all char’s are digits • name.startswidth(‘Har’) # returns True if so! • do_something(type) # here do_something is a function • Examples: • len(“TATATATA”) # returns the length of a string • math.sqrt(34.5) # returns the square root of 34.5
String methods • Methods are similar to functions except they are called in a different way (ie different syntax) It uses dot notation • word =‘rabbit’ • uword = word.upper() return the string capitalized • string method no arguments • there are a lot of string methods. Here is another • string.capitalize() Returns a copy of the string with only its first character capitalized.
Find() A string method • string.find(sub[, start[, end]]) Return the lowest index in the string where substring sub is found, such that ub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 if sub is not found. statement = ‘What makes you think she's a witch? Well she turned me into a newt’ index = statement.find('witch') print index index2 = statement.find('she') print index2 index3 = statement.find('she',index) print index3 >>> 29 21 41 >>>
The in operator with strings • The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second. • >>> ‘a’ in ‘King Arthur’ • False • >>> ‘Art’ in ‘King Arthur’ • True What does this function do? def mystery (word1,word2) for letter in word1: if letter in word2: print letter Prints letters that occur in both words
More examples • >>> ‘TATA’ in ‘TATATATATATA’ • True • ‘AA’ in ‘TATATATATATATATA’ • False • >>> ‘AC’ + ‘TG’ • ‘ACTG’ • >>> 5* ‘TA’ • ‘TATATATATA’ >>>‘MNKMDLVADVAEKTDLS’[1:4] ‘NKM’ >>>‘MNKMDLVADVAEKTDLS’[8:-1] ‘DVAEKTDL’ >>>‘MNKMDLVADVAEKTDLS’[-5,-4] ‘K’ >>>‘MNKMDLVADVAEKTDLS’[10:] ‘AEKTDLS’ >>>‘MNKMDLVADVAEKTDLS’[5:5] ‘’ >>>‘MNKMDLVADVAEKTD’.find(‘LV’) 5
string comparisions • The relational operators also work here • if word == ‘bananas’: • print ‘Yes I want one’ • Put words in alphabetical order • if word1 < word2: • print word1,word2 • else: • print word2, word1 • NOTE: in python all upper case letters come before lower case! i.e. ‘Hello’ is before ‘hello’
Lets download a book and analyze it • Go to http://www.gutenberg.org/ and download the first edition of Origin of the Species by Charles Darwin. Be sure in download the pure text file version and save as oots.txt • (http://www.gutenberg.org/files/1228/1228.txt) • This little program will read in the file and print it to the screen. • file = open('oots.txt', 'r') #open for reading • print file.read() • NOTE: The entire file is read in and stored in the memory of the computer under the name file! • See: http://www.pythonforbeginners.com/systems-programming/reading-and-writing-files-in-python/
I don’t want the whole file! • The readline() function will read from a file line by line (rather than pulling the entire file in at once). • Use readline() when you want to get the first line of the file, subsequent calls to readline() will return successive lines. • Basically, it will read a single line from the file and return a string containing characters up to \n. #prints first 100 lines file = open('oots.txt', 'r') for i in range(100): print file.readline() # prints first line file = open('newfile.txt', 'r') print file.readline() # prints first line file = open('newfile.txt', 'r') line=file.readline() print line #prints entire file using in operator file = open('oots.txt', 'r') for line infile: print file
Does the Origin have the word evolution in it? • #searches for the word ‘evolution’ in the file. It checks every #line individually in this program. This saves space over #reading in the entire book into memory. • file = open('oots.txt', 'r') • for line in file: • if line.find('evolution')!= -1: # if not in line return -1 • print line • print 'done' • #Is this true of the 6th edition? Check it out. What if we want to know which line the string occurs in?
Lets download some DNA • Where do we get DNA? Well http://en.wikipedia.org/wiki/List_of_biological_databases • contains a nice list • Lets use this one http://www.ncbi.nlm.nih.gov/ • Under nucleotide type in Neanderthal and download KC879692.1 ( it was the fifth one in my search) This is the entire Mitochondria sequence for a Neanderthal found in the Denisovacave in the Altai mountains. • Here it is • http://www.ncbi.nlm.nih.gov/nuccore/KC879692.1 • Here is the Denisovian mitochondria.
Stripping the annotation info • The annotation info for a Genbank file is everything written above the ORIGIN line. Lets get rid of this stuff using a flag variable • file = open('neanderMito.gb', 'r') • fileout = open("stripNeander.txt", "w") • # This code strips all lines above and including the ORIGIN line • # It uses a flag variable called originFlag • originFlag = False • for line in file: • if originFlag == True: • print line, #The comma suppresses the line feed • fileout.write(line) • if line.find('ORIGIN')!= -1: # When this turns false start printing • originFlag = True # to the output file • fileout.close() An absolute requirement to dump buffer
Stripping the annotation info 2 • The annotation info for a Genbank file is everything written above the ORIGIN line. Another method • file = open('neanderMito.gb', 'r') • fileout = open("stripNeander.txt", "w") • line = file.readline() • while not line.startswith('ORIGIN'): # skip up to ORIGIN • line = file.readline() • line = file.readline() • while not line.startswith('//'): • print line, • fileout.write(line) • line = file.readline() • fileout.close() An absolute requirement to dump buffer another string method. Look it up!
Now we have now • 1 gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtatttt • 61 cgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtc • 121 gcagtatctgtctttgattcctgccccatcctattatttatcgcacctacgttcaatatt • 181 acagacgagcatacctactaaagtgtgttaattaattaatgcttgtaggacataataata • 241 acgattaaatgtctgcacagccgctttccacacagacatcataacaaaaaatttccacca • 301 aaccccccccctccccccgcttctggccacagcacttaaacatatctctgccaaacccca • 361 aaaacaaagaaccctaacaccagcctaaccagatttcaaattttatcttttggcggtata • 421 cacttttaacagtcaccccctaactaacacattattttcccctcccactcccatactact • 481 aatctcatcaatacaacccccgcccatcctacccagcacacaccgctgctaaccccatac • 541 cccgagccaaccaaaccccaaagacaccccccacagtttatgtagcttacctcctcaaag We want to get rid of the numbers and spaces. How does one do this? What type of characters are left in this file? digits, a,c,t,g, spaces, and CR’s
So lets strip everything but a,c,t,g • file = open("stripNeander.txt", "r") • fileout = open('neanderMitostripped.txt', 'w') • # This code strips all characters but a,c,t,g • for line in file: • for char in line: • if char in ['a','c','t','g']: # I’m using a list here • fileout.write(char) • fileout.close() What is in the fileout now? One very long line, i.e. there are NO spaces or CR’s
What if we want to do all this on a lot of files? • The easiest way would be to turn the previous processing to a function . Then we can use the function on the files. • # This function strips all characters but a,c,t,g from file name, returns string • def stripGenBank(name): • file = open(name, "r") • sequence = '' • originFlag=False • for line in file: • if originFlag == True: • for char in line: • if char in ['a','c','t','g']: # I’m using a list here • sequence = sequence + char # attach the new char on the end • if line.find('ORIGIN')!= -1: • originFlag = True • return (sequence) • print stripGenBank('neanderMito.gb')
Lets compare Neanderthal with Denisovan • neander =stripGenBank('neanderMito.gb') • denison = stripGenBank('denosovanMito.gb') • for i in range(10000): • if neander[i]!=denison[i]: • print '\nFiles first differ at location ',i • index = i+1 • print 'Neanderthal is ',neander[i], ' and Denisovan is ',denison[i] • break • print neander[:index] #Dump up to where they differ • print denison[:index]
The OUTPUT of this comparison • Files first differ at location 145 • Neanderthal is c and Denisovan is t • gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgccc • gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgcct