330 likes | 1.02k Views
Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”. UPHPU Meeting January 18, 2005 Mac Newbold mac@macnewbold.com. Who am I?. Full-time self-employed computer geek MNE, LLC (macnewbold.com, owner) and
E N D
Strings and Regular Expressions in PHP, or“PCRE, POSIX, and Bears, Oh My!” UPHPU Meeting January 18, 2005 Mac Newbold mac@macnewbold.com
Who am I? • Full-time self-employed computer geek • MNE, LLC (macnewbold.com, owner) and • Digital Media Consulting, LLC (a.k.a. Dmedia, www.dmedia.ws, partner) • Wide variety of PHP-driven web sites, mostly with MySQL and without Javascript and Flash • Background: B.S. C.S. ’01, M.S. C.S. ’05 • University of Utah – Go Utes! UPHPU - Mac Newbold
Campaign Promises • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold
Introducing: Strings in PHP • Much like strings in any other language • Major difference: Boundary between string, integer, float, and boolean is very blurred • Actually a benefit: if it’s not a string, but should be, it will be • Though this can lead to some unexpected results • Info in PHP Manual: • www.php.net/strings • www.php.net/manual/en/language.types.string.php UPHPU - Mac Newbold
String Syntax • Single quotes: ’a string’ • No variable interpolation, \’ is only escape code • Double quotes: ”a $better string\n” • Variables work, standard escape codes work • “Here-doc” syntax: $foo = <<<END … END; • Great for large multi-line blocks of text or html • Variables are interpolated • Gotchas: newline must follow <<<END • END; must be the entire line, with no whitespace UPHPU - Mac Newbold
String Operators • Array-like character access: • $str = “MyBigString” => $str{3} == “B” • Concatenation: the dot operator • ”This lets you join strings into ”. ”bigger ones” • Note: Avoiding embedded newlines “in strings that wrap onto multiple lines” is a good idea • Concatenating Assignment : .= • $str = ”My name is”; $str .= ” Mac.\n”; UPHPU - Mac Newbold
Variables in Strings • “Simple string with a $var in it\n” • “You can use $an_array[$var] too\n” • “Sometimes you need ${curl}ies to mark where the {$var}iable ends” • “Curlies help on {$big[‘fancy’][$stuff]} too” • “Where it’s confusing to embed “. $big[‘ugly’][$var].”iables, break it up as needed with concatenation.” UPHPU - Mac Newbold
Must-Have String Functions • www.php.net/strings • echo/print – (print $foo)==1, echo “can”, $take,”more than one”,”argument”; • Echo shortcut: <b><?=$foo?></b> • trim, ltrim, rtrim/chop – remove whitespace • explode, implode/join • $arr = explode(“ “, “List of words”); • $str = implode(“,”,$arr); UPHPU - Mac Newbold
Obligatory C-like Functions • All your old favorites are in there: • printf, sprintf, sscanf, fprintf • strcmp, strlen, strpos, strtok • They all do just what you expect, though many of them have easier alternatives • Gotcha: Some of them (like strpos and friends) return boolean false, because 0 is a valid result. Always use “===false”. UPHPU - Mac Newbold
Basic String Manipulation • Any of this can be done with regular expressions as well… • and in more complex cases, can only be done with regular expressions • But regular expressions are slower (more later) • str_replace(“bar”,”baz”,”foobar”); • str_repeat(“1234567890”,8); UPHPU - Mac Newbold
Formatting functions • strtolower, strtoupper • ucfirst, ucwords – uppercase first char, or first char of each word • wordwrap – wrap text to a given width • str_pad(“tooshort”,15,” ”); • vprintf, vfprintf, vsprintf – formatted output • number_format – add thousands grouping • money_format – format as currency UPHPU - Mac Newbold
Special-Purpose Functions • One of PHP’s strengths is the way it caters to the common things people need • Many string functions are specifically for use with things like dates/times, URLs, HTML, and SQL databases • Advice: When you need them, use them. “Rolling your own” doesn’t usually work out the way you plan it. UPHPU - Mac Newbold
Date and Time Functions • www.php.net/datetime • A variety of functions to not only do calculations with dates, but to convert dates to strings – date(), strftime() • And more importantly, to convert strings to dates – strtotime(), strptime() • Great example of why not to “roll your own”, even if it doesn’t seem that complex at first UPHPU - Mac Newbold
URL Functions • www.php.net/url • urlencode, urldecode • Turn non-alphanumerics to %[hex] and ‘ ‘->’+’ • rawurl{en,de}code do the same except for ’+’ • parse_url – break into host, path, query, etc. • http_build_query – turn array to URL query • base64_{en,de}code – base64 conversions for use with MIME, etc. UPHPU - Mac Newbold
HTML Functions • htmlspecialchars – encode &, “, <, and > with &, ", <, and > • htmlentities is same but for every char • html_entity_decode is the reverse • nl2br – turn newline (\n) into <br> tags • parse_str – parse GET query into variables or an array (see also: extract) • strip_tags – strip html tags [selectively] UPHPU - Mac Newbold
SQL Functions • “Magic Quotes” – on by default • Misnamed – adds magic slashes, not quotes • addslashes, stripslashes – escape ‘, “, and \ • Advice: do db queries first, then use $var = htmlspecialchars(stripslashes($input)) for use in <input value=‘$var’> tags • quotemeta – escape . \ + * ? [ ^ ] ( $ ) • Good for commands: system() and `backticks` UPHPU - Mac Newbold
Now for the fun stuff… • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold
Regular Expressions • Extremely powerful tool for pattern matching – same thing used by compilers and interpreters to run your programs • Two flavors in PHP: • PCRE – Perl-Compatible Regular Expressions • POSIX Extended • I favor PCRE – multiple languages, more features, faster, and binary-safe UPHPU - Mac Newbold
Basics of RE’s • They match patterns – the magic is in the pattern you tell them to match • They have to be precise, including and excluding exactly what you want • People get scared of them because the details can be tricky • But they’re one of the best tools you have for doing some pretty fancy string stuff UPHPU - Mac Newbold
RE Patterns • Start with strings and grouping: “abc(def)” • Add alternative branches: “abc(def|123)” • Wildcard: . matches any char but \n • Quantifiers/Repeating: • * = “0 or more”, + = “1 or more”, ? = “0 or 1” • {n} = “n times”, {n,m} = “n to m times” • “(abc)+(def|123)*(.{2})*” • At least one abc, maybe some triplets, then an even number of characters UPHPU - Mac Newbold
Character Classes and Types • [] makes character classes • List of characters and ranges: [a-zA-Z0-9] • If you want to use -, put it at the beginning • Escape any special chars with \ as usual • If first char is ^, class is negated • \d = [0-9], \D = [^0-9] • \s = whitespace, \S = non-whitespace • \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_] • \b = word boundary – “zero-width assertion” UPHPU - Mac Newbold
Anchors • What if you want to force it to match only at the beginning of the string? Or to match the entire string? • Use an anchor! • ^ as the first char anchors the beginning • $ as the last char anchors the end • (Varies slightly in multi-line mode) UPHPU - Mac Newbold
Greediness and Modifiers • Regular Expressions are Greedy • They’ll keep eating characters as long as they can keep matching. • Consider: “<.*>” vs. “<[^>]*>” when matching against “<b>Hi</b>” • PCRE has modifiers: /<pattern>/<mods> • /i = case insensitive • /U = un-greedy • /m = multi-line UPHPU - Mac Newbold
Back References • Most commonly used in replace operations, but can be used in match patterns as well • Parentheses not only group, but capture too • Use \ followed by the number of the capture • “ab(.)\1(.)\2” will match abccdd or abxxyy, but not abcccd or abdcdc • Can get tricky to count which backref goes where with nested parentheses UPHPU - Mac Newbold
Modifiers for Parentheses • PCRE Only – makes some things possible that otherwise couldn’t be done • Non-capturing grouping: (?: ) • Can simplify back-reference counting • Look-ahead Assertions: • They don’t advance the matching position • Positive: (?= ), or Negative: (?! ) • Very powerful, but not always easy to understand. Trial and error can be your friend! UPHPU - Mac Newbold
PCRE Specifics • www.php.net/pcre • preg_match, preg_match_all, preg_replace, preg_split, preg_grep (filter an array) • Perl RE’s have a delimiter, usually /, but can be anything: • preg_match(“/foo/”,$bar); • preg_match(“%/usr/local/bin/%”,$path); UPHPU - Mac Newbold
POSIX Specifics • www.php.net/regex • ereg, ereg_replace, split, eregi, spliti, etc. • [Only] Advantage over PCRE: It doesn’t require the PCRE library to be installed, so it’s always there in any PHP installation • Other regex engines support this specification, though the Perl style seems to be more popular. UPHPU - Mac Newbold
Almost there… • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold
Performance/Speed • Rule of thumb: use the simplest function that will get the job done right • strpos instead of substr • str_replace instead of preg_replace • And so forth… • The PHP manual online usually includes notes about speed differences • PCRE is faster than POSIX Regex UPHPU - Mac Newbold
Grab Bag • md5, md5_file – Calculate md5 hashes • Great for passwords in databases, etc. • levenshtein, similar_text – calculate the “similarity” of two strings • metaphone, soundex – calculate how similar two strings sound when spoken out loud • str_rot13 – Encryption algorithm • Protected by the DMCA UPHPU - Mac Newbold
Grab Bag 2 • str_shuffle – words are much more fun once they’ve been randomized • count_chars, str_word_count – statistics about your strings • str_rev – if it doesn’t make sense forward, try it backwards UPHPU - Mac Newbold
Grand Finale • Any questions? UPHPU - Mac Newbold
Group Practice • 8.3 filenames - anything but zip files • /^.{0,8}(\.[^z][^i]?[^p]?)?$/i – fails filename.ftp • /^.{0,8}\.(!?zip)$/I – PCRE only • Sometimes easier to match rejects rather than keepers • Apache access log example: • 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt HTTP/1.0" 404 337 "-" "Holmes/1.0" • preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP • "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ". • "\"-\" \"([^"]*)\"$/",$row,$matches); UPHPU - Mac Newbold