Perl (2) Hongkang Mei, Ph.D. March 10, 2002

Perl (2) Hongkang Mei, Ph.D. March 10, 2002

Review of Perl (1) • More on I/O • Regular Expression basics • More on regular expression • Using regular expressions • File and directory handles

Scalar data something single or just one number or string, interchangeable acted upon with operators (a scalar variables holds value of a scalar) • List data list of scalars (array is a variable contains list) (hash or associative array is a variable contains a list with pairs of scalars associated to each other)

Scalar variables ‘$’ followed by Perl identifier • Should be descriptive • Perl built-in scalar variables • $ARGV, $_, ”…… *Perl identifier letters, ‘_’, digits, not begin with digit

Numeric operators • ++ incrementing the value • $counter++; • $v = $counter++; • is different from • $v = ++$counter; • -- decrementing the value

List data listliterals scalars separated by ‘,’ in () (1, 2, 3, 4, 5) (“dnaA”, “argC”, “rnpA”) qw/ dnaA argC rnpA/ range operator .. (1..5) # (1, 2, 3, 4, 5) (1.2..5.7) # same (5..1) # empty ($a..$b) # depend on current values * The qw shortcut treated like ‘’ string uses any punctuation pairs / /, “”, {}, [], (), <>, ##, !!

Array variables • @ + identifier • no unnecessary limit • Array elements: scalar variables 0 1 2 3 4 Array{ C Scalar variable indices

Accessing array elements • achieved by calling the scalar variables: • $seq[0] • print $seq[3]; • $seq[1] = ‘acg’; • $seq is a different thing! • @ and $ have different namespaces

Hashes • A hash is a variable containing list with • paired scalar values associated each other • % + identifier • no unnecessary limit • keys values C Hash { Scalar variables

Hash element access • $hash{$key} • $seq{“dnaA”} = “CAGACTCGAT”; • foreach $gene (qw/dnaA argC rnt/) { • print “The sequence for $gene is $seq{$gene}.\n”; • } • $key can be expr. • $seq{“unknown”} # undef

Interpolation of variables into strings • Scalar: • print $aa_seq; • print “The sequence is $aa_seq.\n”; • print “The file contains $count ${type}s.\n”; • Array: • print “The list contains @array\n”; • print @array; • print 3 * @array; • Hash: • print “The AC# for ‘dnaA’ is $g_ac{‘dnaA’}; • NO interpolation for the whole hash!! • Printf “The %s has %d AAs.\n”, $prot, $len;

SCALAR and LIST CONTEXT • Using the same variable in different context • means different things • depending on what Perl is expecting • 5 + @aa; # scalar • sort @aa; # list • @list = @aa; • @list = $aa; • $aa[0] = @list; • print “The full aa list is @aa.\n”; • print “The number of aa is “ . @aa . “.\n” • print @aa;

Control structures • if (true) {...}elsif{…}else{…} • while (true) {...} • foreach $line (list){...} • for($i=1; $i<11; $i++) {…} • unless(true){…} #if(false){…} • until(true){…} #while(false){…}

Control structures • autoincrement autodecrement • $n++; $n--; • ++$n; --$n; • $m = $n++; • $m = ++$n; • $m = $n; $n++; • logical operators • &&, ||, ! • and, not, or

Control structures • expression modifier • print “Acidic\n” if $pH < 7; • print “ “, ($n += 2) while $n < 10; • print “$aa{$_[0]}\t” foreach (keys %codon); • short-circuit operator • my $n_aa = $aa{$codon} || “not in the list”; • the ternary operator ?: • $aa = ($pI{$aa} < 7) ? “acidic” : • ($pI{$aa} = 7) ? “neutral” : • ($pI{$aa} > 7) ? “basic”;

Subroutines • functions or subroutines • define: • sub my_funct { • $dna_length = 3 * length($aa_seq); • print “DNA is $dna_length basepairs\n”; • } • Invoke: • &my_funct;

Built-in functions • print • chomp • defined • chop • reverse, sort • pop, push, shift, unshift • return • length • scalar # a fake one • …… • perlfunc manpage

<STDIN>: get user input • from commandline: • chomp ($a = <STDIN>); print $a; • # input ends up at newline • file redirection: • %>myprog.pl < my_input.txt • …… • $line_n = 1; • while (<STDIN>){ • print “$line_n\t$_; • $line_n++; • }

<>: get user input from commandline • %>myprog.pl input1 - input2 • …… • $line_n = 1; • while (<>){ • print “$line_n\t$_; • $line_n++; • } • The difference between <> and <STDIN> • <> works from @ARGV

more on print • buffer • print <>; #string operator • # work like cat in commandline • print () function • print (3+4)*5; • print “The result is: “, (3+4)*5;

printf • printf “The mutation is at %s position.\n”, • $count_mut; • %s, %f, %d, %g…… • %2d • %-12s (left justified) • %12.3f (right justified) • %: does not interpolate whole hash • %% to print ‘%’

regular expression or pattern • mini-program • match or doesn’t match a given string • match any number of strings • doesn’t matter how many times • it matches to a string • works like grep • $p_seq = “ADCSFTSCGNYEQ”; • if(/SFT/){ • print “It has the motif \”SFT\”.\n” • }

metacharacters • . Matches anything but “\n” • \ escape (/3\.14) • () grouping

simple qualifiers • the following qualifiers repeat the previous pattern • * 0 or more times • + 1 or more times • ? 0 or 1 times

the ‘|’alternative pattern • /T|S/ • /protein(and|or)DNA/ • /arg(ser|cys)lys/

character classes • [] matches any single character inside • [AGCT] # any deoxynucleotides • [a-zA-Z0-9]+ # 1 or more of letters or digits • [;\-,] # ‘-’ needs to be escaped

character classes shortcuts • \d [0-9] • \w [A-Za-z0-9_] # only a char, \w+ a word • \s [\f\t\n\r ] # whitespace • negating the shortcuts • \D [^\d] • \W [^\w] • \S [^\s] • can be part of a larger class • [\dA-F] • [\d\D] (any char) • [^\d\D] (nothing)

general qualifiers • * 0 or more repetitions • + 1 or more • ? 0 or 1 • {3, 5} 3 to 5 • {3,} 3 or more • {3} exactly 3 repetitions • /U{5,8}/ • /\w{8}/ • /A{15,100}/ • /(arg){2,}/ • * {0,} • how about + and ?

anchors • ^ marks beginning of the string • /^ATG/ # initiation codon • [^AGCT] # ? • $ marks the end • /(UA[AG]|UGA)$/ # stop codons • /^\s*$/ # a blank line

word anchors • \b word boundary anchor • matches either end of a word • /\barg/ # arg, arginine, arginyl, argue…… • /\barg\b/ # arg • \B nonword boundary anchor • matches any point that \b would not • \barg\B/ # arginine, arginyl, argue……

memory () • () grouping • matched part kept in memory • /A(ACGT)T/ # ACGT in memory • backreferences • \1 \2 • /(AACGTT).*\1/ # can EcoRI cut the insert out? • /(.)\1/ NOT /../ #two same char; two char • memory variables • $1

precedence • which parts of the pattern stick together • more tightly • () • *+?{} • ^$\b\B sequence • | • atoms chars, classes, backreferences • examples • /^fred|barnay$/ • /^(\w+)\s+(\w+)$/

m// • a more general pattern match operator • can use any pairs of delimiters • // • m,, m!! m^^ m## • m<> m{} m[] m() • example: • m%^http://% is better than /^http:\/\//

option modifiers • /i case insensitive • matches both cases for all letters • /\byes\b/ # Yes yes YES • /s matches any character • more than . • /\d\D/ • $_ = “ACGTTTGCG\nAACACGT”; • /^(ACG).*(CGT)$/s • do not confuse with the \s shortcut

combiningoption modifiers • /si # both /s and /i • $_ = “aCGTTTGCG\nAACAcGT”; • if(/^(ACG).*(CGT)$/si){ • print “That sequence begins with ACG”, • “and ends with CGT.\n” • } • other options

the binding operator =~ • if (/\w+/i){……} # only works on $_ • if ($seq =~ /^(ACG).*(CGT)$/si){ • print “That sequence begins with ACG”, • “and ends with CGT.\n” • } • $prot_seq = <STDIN> =~ /[^ACGT]/i; • if ($prot_seq) {blastp;}

interpolating into patterns • my $p = “arg”; • if ($seq =~ /($p)$/si){ • print “That sequence ends with $p.\n” • } • $profile = shift @ARGV; # get commandline args • if ($prot_seq =~ /$profile/si) { • print “$prot_seq has motif $profile; • }

the match variables • /(A)\1/ # use \1 inside pattern • $1 # hold memory value in Perl code • if ($seq =~ /(g.)\1/si){ • print “That sequence has a $1 repeat.\n” • } • if ($prot_seq =~ /([stavli]{3,}).*([Deq]{3,})/si) { • print “$prot_seq has hydrophobic region $1 ”, • “followed by hydrophilic region $2.\n”; • }

the persistence of match • next successful match will overwrite the earlier one • store your $1 away! • if ($prot_seq =~ /([cstv]+)/si) { • my $motif = $1; • } • test your match before using $1 • it could be a leftover • $prot_seq =~ /([cstv]+)/si; • print “I found the motif $1, correct?\n”

automatic matched variables (PP121) • $& matched part of string • $` part before the match • $’ part after the match • $`$&$’ the whole string • …… • print “The matched string is the following,”, • “the part matched is in <>:\n”, • “$`<$&>$’\n”;

substitutions with s/// • m// search • s/// search and replace • s/match_pattern/replacement_string/ • returns true if successful, false if not • replacement string: • $1 • empty • $& • words • whitespaces • ……

examples of s/// • if (s/([a-z]{3})([cstyleu]{3})/$2/){ • print “The mutant protein has a $1 deletion ”, • “before $2.\n”; • s/(arg)/$1$1/; # arg insertion • s/arg/cys/; # cys substitution • s/\s+//g; # get rid of all whitespaces • s/\s+/ /g; # single space delimiters • s/[^acgt]//gi; # clean up DNA sequence • s/[tT]/U/g; # translate to RNA sequence • s/_END_.*//s; # chop off after END mark

s/// different delimiters • just like: • m// • qw// • can use unpaired or paired delimiters • ,, “” {} [] %% ## • s#^https://#http://# • s{T}{U} • s<T>#U#

binding operator for s/// • works for non-default variables • $dna_seq =~ s/[^acgt]/n/gis;

case shifting • $dna =~ s/(.+)/\U$1/; • $prot =~ s/(.+)/\L$1/; • $prot =~ s/(\w+)/\u\L$1/gi;

split • split /seperator/, $string; • @aa = split / /, $aa; • split /seperator/; • @aa = split /:/; # used $_ eg. “a:b:c:d” • split //; • @data = split //; # still $_, split each char • split; split /\s+/, $_; • @data = split; # split $_ at whitespaces

Perl (2) Hongkang Mei, Ph.D. March 10, 2002

Perl (2) Hongkang Mei, Ph.D. March 10, 2002

Presentation Transcript

Perl

Perl

Advanced Perl For Bioinformatics

PERL TK

Perl

Perl Tutorial

Perl Tutorial - 2

1.1.2.8.2 Intermediate Perl Session 2

Perl Tutorial - 2

Perl Programming – Appendix

Perl

Perl Challenge

Perl Day 2

Perl

Perl for Bioinformatics Part 2

Perl