720 likes | 897 Views
WARNING. Python 2.x Not Acceptable. Warning. This course uses Python 3 only Do not turn in code written for Python 2.x Thus, you may not use raw_input (use input instead) and you may not use print as a statement; it is a function Thus you cannot write print "Hello" You must write
E N D
WARNING Python 2.x Not Acceptable
Warning • This course uses Python 3 only • Do not turn in code written for Python 2.x • Thus, you may not use raw_input (use input instead) and you may not use print as a statement; it is a function • Thus you cannot write print "Hello" • You must write print("Hello")
Cells, DNA, RNA and Proteins Simplified!
Cells • The fundamental unit of life is the cell • A cell consists of a protective membrane surrounding a collection of organelles (subcellular structures) and large and complex molecules that provide cellular structure, energy, and the means for the cell to reproduce • In plants and animals, individual cells cooperate to form multicellular tissues and organ systems that meet the biological needs of the organism • We are interested in biological sequences that regulate all biological processes in cells and organisms • Our primary concern are the instructions for the organization of cells during the development of an organism
DNA • The instruction sequences are stored in very long chemical strings called DNA • DNA is the main information carrier molecule in a cell • DNA may be single or double stranded. • A single stranded DNA molecule, also called a polynucleotide, is a chain of small molecules, called nucleotides. • There are four different nucleotides grouped into two types, • purines: adenineand guanineand • pyrimidines: cytosineand thymine. • They are usually referred to as basesand denoted by their initial letters, A, C, Gand T
DNA • Different nucleotides can be linked together in any order to form a polynucleotide, for instance, like this A-G-T-C-C-A-A-G-C-T-T • Polynucleotides can be of any length and can have any sequence • The two ends of this molecule are chemically different, i.e., the sequence has a directionality, like this A->G->T->C->C->A->A->G->C->T->T-> • The end of the polynucleotides are marked either 5' and 3' . • By convention DNA is usually written with 5' left and 3' right, with the coding strand at top.
DNA • Two strands are said to be complementaryif one can be obtained from the other by • mutually exchanging A with T and C with G, and • changing the direction of the molecule to the opposite. A->G->T->C->C->A->A->G->C->T->T-> <-T<-C<-A<-G<-G<-T<-T<-C<-G<-A<-A
DNA • Specific pairs of nucleotides can form weak bonds between them • A binds to T, C binds to G. • Although such interactions are individually weak, when two longer complementary polynucleotide chains meet, they tend to stick together 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' • Vertical lines between two strands represent the forces between them as shown above. • The A-T and G-C pairs are called base-pairs (bp). • The length of a DNA molecule is usually measured in base-pairs or nucleotides (nt), which in this context is the same thing.
DNA Double Helix Two complementary polynucleotide chains form a stable structure, which resembles a helix known as a the DNA double helix. About 10 bp in this structure takes a full turn, which is about 3.4 nm long.
DNA • It is remarkable that two complementary DNA polypeptides form a stable double helix almost regardless of the sequence of the nucleotides • This makes the DNA molecule a perfect medium for information storage • Note that as the strands are complementary, either one of the strands of the genome molecule contains all the informatiion • Thus, for many information related purposes, the molecule used on the example above can be represented as CGATTCAACGATGC • The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence • Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm • Which is approximately 75 GB
DNA • Regions in the DNA sequence encode instructions for the manufacture of proteins in the cell • Proteins are linear chains whose elements come from a set of 20 chemically active building blocks known as amino acids. • Each protein has a unique sequence of amino acids that is determined by a DNA sequence on the chromosomes. • The proteins enable an organism to build needed structures and to carry out its biological functions. • Using a specific biological mechanism – transcription– the DNA is “read” and searched for specific patterns that mark the beginning and end of hereditary information • That information is the gene
RNA • Transcription produces another long string called messenger RNA (mRNA) • The mRNA is what actually used to build the amino acid sequence. • mRNA molecules are very similar structurally and chemically to DNA • Differences: they are single-stranded and have a new base – uracil (M) – in place of thymine (T). They also have a different backbone sugar. Translation • mRNA also has specific regions indicating the start of the code for a protein • Large organelles in the cytoplasm (ribosomes) bind to the start sites • Then move in a defined chemical direction , reading length-three base sequences (codons) at a time • Each codon specifies an amino acid • The corresponding amino acid is then added to a growing chain that comprise the protein • This continues until one of several stop codons is reached
Transcription and Translation • Once formed, proteins rapidly fold from a linear string into simple helical and stranded elements • These new components are then organized into a complex three-dimensional structure • The resulting protein molecule may serve as a tissue building block or have a very specific chemical activity • The collection of proteins produced by an organism, the proteome, is responsible for the organism’s structure and biological behavior.
Python Dictionaries “mappings”
Python Dictionaries • The literals used in directly defining a dictionary are a sequence of “key:value” pairs enclosed in curly braces. D = {'food': 'Spam', 'quantity': 4, 'color':'pink'} • We “index” the dictionary by key to fetch and change the key's associated value: • >>> D['food'] 'Spam' >>> D['quantity'] += 1 >>> D {'food': 'Spam', 'color': 'pink', 'quantity': 5} • We may use dict as a function to convert certain collections to dictionaries, but the collection must consist of two-element lists or tuples • Moreover, the first entry cannot be of a mutable type >>> D = dict(('food','spam'),('color','pink'),('quantity,5)) yields the same dictionary as above.
Python Dictionaries • A dictionary can also be built up one item at a time • First, create an empty dictionary: >>> D = {} • Then create keys by assignment: >>> D['name'] = 'Bob' >>> D['job'] = 'dev' >>> D['age'] = 40 >>> D {'age':40,'job':'dev','name':'Bob'} >>> print(D['name'] Bob
Dictionary Methods may be used like a sequence; to convert to an actual sequence, use list(d.keys())
Dictionaries are the natural Python representation of tabular data. • We will illustrate this by using a dictionary to represent the codon table (the "Genetic Code")
Codon Table as Dictionary >>> RNA_codon_table = { # Second Base # U C A G # U 'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys', # UxU 'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys', # UxC 'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---', # UxA 'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Urp', # UxG # C 'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg', # CxU 'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg', # CxC 'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg', # CxA 'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg', # CxG # A 'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser', # AxU 'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser', # AxC 'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg', # AxA 'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg', # AxG # G 'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly', # GxU 'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly', # GxC 'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly', # GxA 'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly' # GxG }
Codon Table as Dictionary • >>> RNA_codon_table • {'ACC': 'Thr', 'GUC': 'Val', 'ACA': 'Thr', 'AAA': 'Lys', 'GUU': 'Val', 'AAC': 'Asn', 'CCU': 'Pro', 'UGG': 'Urp', 'AGC': 'Ser', 'AUC': 'Ile', 'CAU': 'His', 'AAU': 'Asn', 'AGU': 'Ser', 'ACU': 'Thr', 'GUG': 'Val', 'CAC': 'His', 'ACG': 'Thr', 'CAA': 'Gln', 'CCA': 'Pro', 'CCG': 'Pro', 'CCC': 'Pro', 'GGU': 'Gly', 'UCU': 'Ser', 'GCG': 'Ala', 'UGC': 'Cys', 'CAG': 'Gln', 'UGA': '---', 'UAU': 'Tyr', 'CGG': 'Arg', 'UCG': 'Ser', 'AGG': 'Arg', 'GGG': 'Gly', 'UCC': 'Ser', 'UCA': 'Ser', 'GAA': 'Glu', 'UAA': '---', 'GGA': 'Gly', 'UAC': 'Tyr', 'CGU': 'Arg', 'UGU': 'Cys', 'AUA': 'Ile', 'GCA': 'Ala', 'CUU': 'Leu', 'GGC': 'Gly', 'AUG': 'Met', 'CUG': 'Leu', 'GAG': 'Glu', 'CUC': 'Leu', 'AGA': 'Arg', 'CUA': 'Leu', 'GCC': 'Ala', 'AAG': 'Lys', 'GAU': 'Asp', 'UUU': 'Phe', 'GAC': 'Asp', 'GUA': 'Val', 'CGA': 'Arg', 'GCU': 'Ala', 'UAG': '---', 'AUU': 'Ile', 'UUG': 'Leu', 'UUA': 'Leu', 'CGC': 'Arg', 'UUC': 'Phe'}
“Pretty Printing” Requiresimport pprint >>> pprint.pprint(RNA_codon_table) {'AAA': 'Lys', 'AAC': 'Asn', 'AAG': 'Lys', 'AAU': 'Asn', 'ACA': 'Thr', 'ACC': 'Thr', 'ACG': 'Thr', 'ACU': 'Thr', 'AGA': 'Arg', . . . 'UCU': 'Ser', 'UGA': '---', 'UGC': 'Cys', 'UGG': 'Urp', 'UGU': 'Cys', 'UUA': 'Leu', 'UUC': 'Phe', 'UUG': 'Leu', 'UUU': 'Phe'}
Using the RNA_codon_table >>> deftranslate_RNA_codon(codon): return RNA_codon_table[codon] >>> translate_RNA_codon('AGA') 'Arg'
Streams • A streamis a temporally ordered sequence of indefinite length • Usually limited to one type • Two ends: • source, provides the elements • sink, absorbs the elements • Examples of Python stream sources: • files • network connections • output of special functions called generators • Examples of stream sinks: • files • network sources
Streams • Input to a command-line shell or the Python interpreter becomes a stream of characters • When Python prints to the terminal, also a stream of characters • Illustrates the temporal nature of streams • Keystrokes don't "come from" anywhere • They are events that happen in time • Implementation detail: buffering
Files • Depending on a parameter used on creation, the elements “flowing” to/from the file stream are either bytes or Unicode characters • Some methods treat files as streams of bytes or characters, others as lines of bytes or characters • Most of the time, a file is a one-way sequence – it can be read from or written to • While it is possible to create a two-way file object, better to think of it as two separate streams • Files opened for reading are assumed to already exist • An attempt to open a non-existent file for reading results in an error • Files can also be opened for appending to an existing file • When a file is opened for writing it is created if it does not exist • If the file does exist, its contents are erased as the result of being opened
Creating File Objects • File objects are created by a call to the built-in function open(path,mode) • path is a string that specifies the location for the physical file represented by the Python file object • mode is a string of length one or two or three which specifies the type of file interaction • A substring of the mode string specifying the intended useof the file is mandatory • The useoptions are 'r', 'w', 'a', 'r+', 'w+', 'a+'; 'r' is the default • An optional single-character specifies the file object's value type • The value options are: 't', 'b'. If neither is present, it is assumed to be 't' • The meaning of the mode string contents are given in the following tables
File Modes (Unicode) Correction
Creating File Objects • Simple use of the open function:f = open('C:\Users\rtindell\myfile','r') • When you are finished using the file object, you must close it:f.close() • There are hazards to using this approach, which, although relatively rare, can have dire consequences • If your script crashes before the close statement is executed, there may be writes whose data was not actually written to the physical file • Due to the way the underlying hardware works, transfers to and from external drives are done in fairly large chunks (blocks) • The chunks are kept in special pieces of computer memory called buffers • Requests for reading or writing are satisfied by the buffers until the entire content of the buffer has been used • At that point, entire blocks are transferred between main memory and the drive • If the write buffer was half-full when your script crashed, the buffer data would never be written to the disk
The with Statement • Python provides a way to make sure files are closed regardless of other events
File Read Methods In the following, f is a file object • f.read(count) • Treats the file as an input stream of characters • Reads up to count bytes from the current file position into a string and returns that string • The file position is then the next byte in the file • If there were fewer than count bytes left in the file, returns just the remaining bytes • If the file position is the end of the file, returns the empty string • f.read() • Reads the bytes from the current position to the end of the file into a string and returns that string • The file position is then the end of the file
File Read Methods In the following, f is a file object • f.readline() • Treats the file as an input stream of lines • Reads one line from the file and returns the entire line, including the end-of-line character • The file position is then the beginning of the next line of the file or the end of file if all bytes in the file were exhausted • If the initial file position is the end of the file, returns the empty string • f.readline(count) • Same as f.readline(), but limits the number of bytes read to count.
File Write Methods In the following, f is a file object , s is a string and seq is a sequence object • f.write(s) • Treats the file as an output stream of characters • Writes the string s to the file represented by f • f.writelines(seq) • Treats the file as an output stream of lines • All elements of seq must be strings • Writes each element of seq to the file • Despite the name, it does not add newline characters to the elements • The print statement also provides a mechanism for writing to a file • You just use a keyword argument as the last argument • Example: print('Hello','young','bioinformaticians',file=f)
FASTA Format • FASTA formatted files are widely used in bioinformatics • They consist of one or more base or amino acid sequences broken up into fixed size lines, each preceded by a single header line • The header line starts with a " >" symbol. • The first word on this line is the name of the sequence. The rest of the line optionally provides a description of the sequence. • Meaning of the header line entries is given in the first line below and does not exist in the FASTA file. The second line is the actual entry for our example. IdentifierMolecule TypeGene NameSequence Length FOSB_MOUSE Protein fosB 338 bp
FASTA Format • FASTA sequence identifiers are usually more complex than previously shown and distinguish various possible sources for the sequence • Below is a table of identifier formats accession= Accession Number In a genomic context, locus refers to position on a chromosome. It may, therefore, refer to a marker, a gene, or any other landmark that can be described.
FASTA Example >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Creating a Sequence Dictionary from a FASTA File • We next present three ways to read the contents of a file containing FASTA format data into a dictionary whose keys are the sequence identifiers • A dictionary value will be a length-two list of strings • The first string will contain the sequence description if present in the FASTA file, otherwise it will be the empty string • The second string will be a single string containing the sequence itself • If Dwere the name of our dictionary and seqidthe identifier of a sequence from the file, we could access the sequence of that name asD['seqid'][1]
Method 1: Reading the Entire File into a String # file fasta_dict1.py deffasta_to_dictionary(fpath): D = {} with open(fpath,'r') as f: # Separate entries S = f.read() J0 = S.split('>') J = [j for j in J0 if j != ''] # Eliminate empty lines # J is now a list of strings, each of which contains one of # the sequence specifications from the FASTA file
# (fasta_to_dictionary definition continued) for B in J: C = [k for k in B.splitlines() if k != ''] # C[0] is the first line of B # and is thus the name-description line comps = C[0].split() key = comps[0] # First word is the identifier for the sequence # Remaining words in comps are sequence description components if len(comps) > 1: descr = ' '.join(comps[1:]) else: descr= '' # Remaining lines of B contain the split-up sequence # so join them into a single line seq = ''.join(C[1:]) D[key] = [descr,seq] f.close() return D
Main Body of fasta_dict1.py # file fasta_dict1.py continued # Test the function D = fasta_to_dictionary('fdata') if len(D) == 0: print('No FASTA data found') else: for k in D: print('Sequence Identifier:\t\t',k) print('Sequence Description:\t',D[k][0]) print('Sequence:') print(D[k][1],'\n')
Contents of Test File fdata >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL >FOSB_RAT Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Output of fasta_dict1.py Sequence Identifier: FOSB_MOUSE Sequence Description: Protein fosB. 338 bp Sequence: MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Sequence Identifier: FOSB_RAT Sequence Description: Protein fosB. 338 bp Sequence: MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Possible Problems with Script fasta_dict1.py • If we were trying to process a very large file, reading its entire contents into a string in memory might not be possible • We will modify the script so that it processes one line at a time using the readline method • Of course, this is only one part of a correction since the dictionary we build would be at least as large as the file! • We will address that problem later • We will use the with statement so that you can see it in use • Note that all statements that use the function f must be in the block of the with statement • Why? After you leave the with block, f has been closed.
Method 2: Reading the File One Line at a Time • Since the two scripts only differ in the fasta_to_dictionary function, we only show that part here. # file fasta_dict2.pydeffasta_to_dictionary(fpath): D = {} with open(fpath,'r') as f: key = '' descr = '' seq = ''
Method 2: Reading the File Line at a Time for line in f: line = line.strip() if line.startswith('>'): if key != '': # Finished with a sequence D[key] = [descr,seq] comps = line.split() key = comps[0] if len(comps) > 1: descr= ' '.join(comps[1:]) else: descr= '' seq= '' else: seq+= line # If there are lines preceding the first # '>' line they will accumulate here and # be discarded when we start processing # the first '>' line # END OF with SUITE # Save the final sequence, which was terminated by file's end if key != '': D[key] = [descr,seq] return D
Exploring the Preceding Examples • The files fasta_dict1.py and fasta_dict2.py have been posted in the Practice Problems page of the course website (not Blackboard) • One way to get an understanding of the scripts is to insert print statements to print intermediate data that appear in the script
Exploring the Preceding Examples • For example, in fasta_dict1.py, you could replace with open(fpath,'r') as f: S = f.read() J0 = S.split('>') J = [j for j in J0 if j != ''] with with open(fpath,'r') as f: S = f.read() print(S) J0 = S.split('>') print(J0) J = [j for j in J0 if j != ''] print(J)
Generators • A generatoris an object that returns values from a series it computes • Example: random.randint • generator objects produce values only on request • Advantages: • can produce a potentially infinitely large series of values and callers can use only as many as they need • can reduce computation by only doing computation needed to produce the desired value • can take the place of a list when creating the entire list would use huge amounts of memory