670 likes | 769 Views
Perl (2) Hongkang Mei, Ph.D. March 10, 2002. Review of Perl (1) More on I/O Regular Expression basics More on regular expression Using regular expressions File and directory handles. Scalar data something single or just one number or string, interchangeable acted upon with operators
E N D
Perl (2) Hongkang Mei, Ph.D. March 10, 2002
Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles
Scalar data something single or just one number or string, interchangeable acted upon with operators (a scalar variables holds value of a scalar) • List data list of scalars (array is a variable contains list) (hash or associative array is a variable contains a list with pairs of scalars associated to each other)
Scalar variables ‘$’ followed by Perl identifier • Should be descriptive • Perl built-in scalar variables • $ARGV, $_, ”…… *Perl identifier letters, ‘_’, digits, not begin with digit
Numeric operators • ++ incrementing the value • $counter++; • $v = $counter++; • is different from • $v = ++$counter; • -- decrementing the value
List data listliterals scalars separated by ‘,’ in () (1, 2, 3, 4, 5) (“dnaA”, “argC”, “rnpA”) qw/ dnaA argC rnpA/ range operator .. (1..5) # (1, 2, 3, 4, 5) (1.2..5.7) # same (5..1) # empty ($a..$b) # depend on current values * The qw shortcut treated like ‘’ string uses any punctuation pairs / /, “”, {}, [], (), <>, ##, !!
Array variables • @ + identifier • no unnecessary limit • Array elements: scalar variables 0 1 2 3 4 Array{ C Scalar variable indices
Accessing array elements • achieved by calling the scalar variables: • $seq[0] • print $seq[3]; • $seq[1] = ‘acg’; • $seq is a different thing! • @ and $ have different namespaces
Hashes • A hash is a variable containing list with • paired scalar values associated each other • % + identifier • no unnecessary limit • keys values C Hash { Scalar variables
Hash element access • $hash{$key} • $seq{“dnaA”} = “CAGACTCGAT”; • foreach $gene (qw/dnaA argC rnt/) { • print “The sequence for $gene is $seq{$gene}.\n”; • } • $key can be expr. • $seq{“unknown”} # undef
Interpolation of variables into strings • Scalar: • print $aa_seq; • print “The sequence is $aa_seq.\n”; • print “The file contains $count ${type}s.\n”; • Array: • print “The list contains @array\n”; • print @array; • print 3 * @array; • Hash: • print “The AC# for ‘dnaA’ is $g_ac{‘dnaA’}; • NO interpolation for the whole hash!! • Printf “The %s has %d AAs.\n”, $prot, $len;
SCALAR and LIST CONTEXT • Using the same variable in different context • means different things • depending on what Perl is expecting • 5 + @aa; # scalar • sort @aa; # list • @list = @aa; • @list = $aa; • $aa[0] = @list; • print “The full aa list is @aa.\n”; • print “The number of aa is “ . @aa . “.\n” • print @aa;
Control structures • if (true) {...}elsif{…}else{…} • while (true) {...} • foreach $line (list){...} • for($i=1; $i<11; $i++) {…} • unless(true){…} #if(false){…} • until(true){…} #while(false){…}
Control structures • autoincrement autodecrement • $n++; $n--; • ++$n; --$n; • $m = $n++; • $m = ++$n; • $m = $n; $n++; • logical operators • &&, ||, ! • and, not, or
Control structures • expression modifier • print “Acidic\n” if $pH < 7; • print “ “, ($n += 2) while $n < 10; • print “$aa{$_[0]}\t” foreach (keys %codon); • short-circuit operator • my $n_aa = $aa{$codon} || “not in the list”; • the ternary operator ?: • $aa = ($pI{$aa} < 7) ? “acidic” : • ($pI{$aa} = 7) ? “neutral” : • ($pI{$aa} > 7) ? “basic”;
Subroutines • functions or subroutines • define: • sub my_funct { • $dna_length = 3 * length($aa_seq); • print “DNA is $dna_length basepairs\n”; • } • Invoke: • &my_funct;
Built-in functions • print • chomp • defined • chop • reverse, sort • pop, push, shift, unshift • return • length • scalar # a fake one • …… • perlfunc manpage
Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles
<STDIN>: get user input • from commandline: • chomp ($a = <STDIN>); print $a; • # input ends up at newline • file redirection: • %>myprog.pl < my_input.txt • …… • $line_n = 1; • while (<STDIN>){ • print “$line_n\t$_; • $line_n++; • }
<>: get user input from commandline • %>myprog.pl input1 - input2 • …… • $line_n = 1; • while (<>){ • print “$line_n\t$_; • $line_n++; • } • The difference between <> and <STDIN> • <> works from @ARGV
more on print • buffer • print <>; #string operator • # work like cat in commandline • print () function • print (3+4)*5; • print “The result is: “, (3+4)*5;
printf • printf “The mutation is at %s position.\n”, • $count_mut; • %s, %f, %d, %g…… • %2d • %-12s (left justified) • %12.3f (right justified) • %: does not interpolate whole hash • %% to print ‘%’
Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles
regular expression or pattern • mini-program • match or doesn’t match a given string • match any number of strings • doesn’t matter how many times • it matches to a string • works like grep • $p_seq = “ADCSFTSCGNYEQ”; • if(/SFT/){ • print “It has the motif \”SFT\”.\n” • }
metacharacters • . Matches anything but “\n” • \ escape (/3\.14) • () grouping
simple qualifiers • the following qualifiers repeat the previous pattern • * 0 or more times • + 1 or more times • ? 0 or 1 times
the ‘|’alternative pattern • /T|S/ • /protein(and|or)DNA/ • /arg(ser|cys)lys/
Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles
character classes • [] matches any single character inside • [AGCT] # any deoxynucleotides • [a-zA-Z0-9]+ # 1 or more of letters or digits • [;\-,] # ‘-’ needs to be escaped
character classes shortcuts • \d [0-9] • \w [A-Za-z0-9_] # only a char, \w+ a word • \s [\f\t\n\r ] # whitespace • negating the shortcuts • \D [^\d] • \W [^\w] • \S [^\s] • can be part of a larger class • [\dA-F] • [\d\D] (any char) • [^\d\D] (nothing)
general qualifiers • * 0 or more repetitions • + 1 or more • ? 0 or 1 • {3, 5} 3 to 5 • {3,} 3 or more • {3} exactly 3 repetitions • /U{5,8}/ • /\w{8}/ • /A{15,100}/ • /(arg){2,}/ • * {0,} • how about + and ?
anchors • ^ marks beginning of the string • /^ATG/ # initiation codon • [^AGCT] # ? • $ marks the end • /(UA[AG]|UGA)$/ # stop codons • /^\s*$/ # a blank line
word anchors • \b word boundary anchor • matches either end of a word • /\barg/ # arg, arginine, arginyl, argue…… • /\barg\b/ # arg • \B nonword boundary anchor • matches any point that \b would not • \barg\B/ # arginine, arginyl, argue……
memory () • () grouping • matched part kept in memory • /A(ACGT)T/ # ACGT in memory • backreferences • \1 \2 • /(AACGTT).*\1/ # can EcoRI cut the insert out? • /(.)\1/ NOT /../ #two same char; two char • memory variables • $1
precedence • which parts of the pattern stick together • more tightly • () • *+?{} • ^$\b\B sequence • | • atoms chars, classes, backreferences • examples • /^fred|barnay$/ • /^(\w+)\s+(\w+)$/
Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles
m// • a more general pattern match operator • can use any pairs of delimiters • // • m,, m!! m^^ m## • m<> m{} m[] m() • example: • m%^http://% is better than /^http:\/\//
option modifiers • /i case insensitive • matches both cases for all letters • /\byes\b/ # Yes yes YES • /s matches any character • more than . • /\d\D/ • $_ = “ACGTTTGCG\nAACACGT”; • /^(ACG).*(CGT)$/s • do not confuse with the \s shortcut
combiningoption modifiers • /si # both /s and /i • $_ = “aCGTTTGCG\nAACAcGT”; • if(/^(ACG).*(CGT)$/si){ • print “That sequence begins with ACG”, • “and ends with CGT.\n” • } • other options
the binding operator =~ • if (/\w+/i){……} # only works on $_ • if ($seq =~ /^(ACG).*(CGT)$/si){ • print “That sequence begins with ACG”, • “and ends with CGT.\n” • } • $prot_seq = <STDIN> =~ /[^ACGT]/i; • if ($prot_seq) {blastp;}
interpolating into patterns • my $p = “arg”; • if ($seq =~ /($p)$/si){ • print “That sequence ends with $p.\n” • } • $profile = shift @ARGV; # get commandline args • if ($prot_seq =~ /$profile/si) { • print “$prot_seq has motif $profile; • }
the match variables • /(A)\1/ # use \1 inside pattern • $1 # hold memory value in Perl code • if ($seq =~ /(g.)\1/si){ • print “That sequence has a $1 repeat.\n” • } • if ($prot_seq =~ /([stavli]{3,}).*([Deq]{3,})/si) { • print “$prot_seq has hydrophobic region $1 ”, • “followed by hydrophilic region $2.\n”; • }
the persistence of match • next successful match will overwrite the earlier one • store your $1 away! • if ($prot_seq =~ /([cstv]+)/si) { • my $motif = $1; • } • test your match before using $1 • it could be a leftover • $prot_seq =~ /([cstv]+)/si; • print “I found the motif $1, correct?\n”
automatic matched variables (PP121) • $& matched part of string • $` part before the match • $’ part after the match • $`$&$’ the whole string • …… • print “The matched string is the following,”, • “the part matched is in <>:\n”, • “$`<$&>$’\n”;
substitutions with s/// • m// search • s/// search and replace • s/match_pattern/replacement_string/ • returns true if successful, false if not • replacement string: • $1 • empty • $& • words • whitespaces • ……
examples of s/// • if (s/([a-z]{3})([cstyleu]{3})/$2/){ • print “The mutant protein has a $1 deletion ”, • “before $2.\n”; • s/(arg)/$1$1/; # arg insertion • s/arg/cys/; # cys substitution • s/\s+//g; # get rid of all whitespaces • s/\s+/ /g; # single space delimiters • s/[^acgt]//gi; # clean up DNA sequence • s/[tT]/U/g; # translate to RNA sequence • s/_END_.*//s; # chop off after END mark
s/// different delimiters • just like: • m// • qw// • can use unpaired or paired delimiters • ,, “” {} [] %% ## • s#^https://#http://# • s{T}{U} • s<T>#U#
binding operator for s/// • works for non-default variables • $dna_seq =~ s/[^acgt]/n/gis;
case shifting • $dna =~ s/(.+)/\U$1/; • $prot =~ s/(.+)/\L$1/; • $prot =~ s/(\w+)/\u\L$1/gi;
split • split /seperator/, $string; • @aa = split / /, $aa; • split /seperator/; • @aa = split /:/; # used $_ eg. “a:b:c:d” • split //; • @data = split //; # still $_, split each char • split; split /\s+/, $_; • @data = split; # split $_ at whitespaces