160 likes | 266 Views
Perl scripting. Computer Basics. CPU. CPU, RAM, Hard drive CPU can only use data in the register directly. RAM. HARD DRIVE. Computer languages. Machine languages: binary code directly taken by the CPU. Usually CPU model specific. Fast.
E N D
Computer Basics CPU • CPU, RAM, Hard drive • CPU can only use data in the register directly RAM HARDDRIVE
Computer languages • Machine languages: binary code directly taken by the CPU. Usually CPU model specific. Fast. • Assembly language: mapping binary code to three-letter instructions; Platform-dependent. Fast • High-level language: “human-like” syntax, often non-CPU dependent. Compiled into machine code before use. Fast. E.g. C, C++, Fotran, Pascal, Basic. • Scripting language: usually not compiled into binary code. Interpreted and executed on request. Slow. E.g. Perl, Php, PythonJavascript, Bash script,Ruby • Byte-code language: source code converted to platform independent, intermediate code for rapid compilation. Java, Microsoft .NET. Speed intermediate.
Two elements of a program • Data structure & Algorithm • Different data structures may have corresponding, well optimized algorithms for information processing and extraction. (computer science) • For example: Inserting (algorithm) a node (data structure) in a linked list (data structure).
Basic Types • Bit: 1 bit has 2 states, 1 or 0 • 1 Byte = 8 bits, i.e. max(1 Byte) = (binary)11111111 = 255 • Characters in the ASCII encoding can be encoded by 1 byte. In C, data type byte is in fact written as “char” • Byte is the smallest unit of storage. • Boolean (true/false) theoretically takes only 1 bit, but in reality it takes 1 Byte. • How many Boolean states can you store using 1 byte?
BASIC TYPES • Integer: 32 bit, signed -216 + 1 ~ +216 - 1; unsigned +232 -1 • Long integer: 64 bit. • Float: 32 bit. 24bit for significand, the rest for the exponent. • Float point numbers could lose precision, try this in perl: • print 0.6/0.2-3; • Correct way: • sub round { • my($n) = @_; • return int($n + $n/abs($n*2)); • } • print round(0.6/0.2)-3;
Pointers / reference • Pointers (or reference in other languages) are essentially an integer. • This integer stores a memory address. • This memory address refers to another variable. • http://perldoc.perl.org/perlref.html
Complex types • Set: unordered values. • Array (vector): a set of ordered values of the same basic type. • Index starting from 0 in most langs, last index = length -1 • Hash: key => value pairs. Key must be unique. Array can be thought of as a special Hash where key values are ordered, consecutive integers. • String * : in C, a string is simply an array of characters. In many other languages, strings are treated as a “basic type”. Most algorithms for arrays also works for strings.
Complex types • Classes: objected-oriented programming • A class packages related data of different datatypes, as well as algorithms associated with them into a nice blackbox for you to use. • Objected-oriented programming.
PERL • PERL lumps all “basic types” as “Scalar”, “$” • PERL interpreter decides on what it “looks like” • Convenient, but sometimes problematic, especially when you parse in a user-provided data file. • Arrays, definition: @, reference $. • Hash, definition: %, reference $ • RegExp • Handlers. • use strict; • PERL has an ugly grammar. • PERL has many short-cuts, such as $_ • DO NOT USE THEM!
Flow control • for, foreach, while, unless, until, if elsif else • http://perldoc.perl.org/perlsyn.html#Compound-Statements
Functions (subroutines) • Traditionally, “subroutines” do not accept parameters • Function is a better term, but b/c perl is ugly so it continues to use sub. • sub functionname { • my($param1, $param2) = @_; #get the parameters • return xxxx. • } • Call: functionname($param1, $param2); • I prefix all private functions with “fn”. But you don’t need to do that. • However, capitalize first letter of each word! • Use Verb + Noun phrases as function names • fnGetFileName(), fnDownloadPicture.
How to name variables • Variable names should reflex their basic types. • Descriptive names should be given, with each word capitalized • I use the c-style prefix on them
Start with the DNA sequence: ATGGAAATGGAGAGGCCTCTGCAAATGATGCCGGATTGTTTCAGACATATAGAAATGTCT, report its length and check if its length can be divided by 3, also check if it's a valid DNA sequence. If check fails, do not continue. • Translate it into Peptide sequences using universal codon table. • Display it on screen in the following format where DNA is on first line, translated amino acids aligns with the middle letter at each codon at the second line: • This DNA sequence goes through generation after generation of replication. • At each replication, it has a user-specified probability (0-1) of single-nucleotide mutation. This mutational probability is specified through the command line.
If mutation happens, 1 random letter in the DNA will be changed to A,T,C or G with equal probability. It's okay if the letter "changes" to the same letter. • Display at each generation the DNA and protein sequence as described in step 3, also display the generation. • Check if a stop codon has occured at each generation. If so the protein has lost its function, stop the evolution and output the generation at which the stop codon occurs. • This program should be able to deal with DNA sequence with upper or lowercase letters.
Create a shell script called getdistr.sh • Run the simulation mutation.pl for 1000 times with mutational probabilities of 0.01, 0.1 and 0.5 respectively • Collect all DNA and protein sequence outputs to dist_$mutationprob.log • Collect the stopping generation at which stop codon first occurs in dist_$mutationprob.txt • Use R to plot dist_0.01.txt, dist_0.1.txt and dist_0.5.txt on a histogram (each parameter with different colors). X axis should be log10(Generation).