310 likes | 663 Views
Perl 6 Update - PGE and Pugs. Dr. Patrick R. Michaud April 26, 2005. Rules and Grammars. Perl 6 completely redesigns the regular expression syntax Regular expressions are now "rules" Rules can call/embed other rules Groups of rules can be combined into Grammars. Current events in Perl 6.
E N D
Perl 6 Update - PGE and Pugs Dr. Patrick R. Michaud April 26, 2005
Rules and Grammars • Perl 6 completely redesigns the regular expression syntax • Regular expressions are now "rules" • Rules can call/embed other rules • Groups of rules can be combined into Grammars
Current events in Perl 6 • Parrot 1.2 released • The Perl Foundation receives $25,000 for completion of Parrot milestones • New Parrot pumpking - Chip Salzenburg • New version of Parrot Grammar Engine (PGE / Perl 6 rules) to be released this week • Pugs - Autrijus Tang • Perl 6 test suite
Pugs • Perl 6 compiler written in Haskell • Started by Autrijus Tang • Compiles directly to Haskell or to Parrot AST • Being used to develop Perl 6 tests and experiment with Perl 6 design • Available at http://pugscode.org • Discussion on perl6-compiler@perl.org mailing list
Perl 6 rules / Parrot Grammar Engine • The heart of the Perl 6 compiler is the Perl/Parrot Grammar Engine (PGE) • Implements the Perl 6 rules syntax, compiles to Parrot code • Perl 6 rules compiler currently written in C • Bootstrap to Perl 6
Steps to Perl 6 compiler • Finish PGE bootstrap in C • Parse p6 "rule" statements and grammars • Use p6 rules to define the Perl 6 grammar • P6 grammar can be used to generate Parrot abstract syntax trees from Perl 6 programs • Compile, (optimize), execute the abstract syntax tree to get working Perl 6 program • Use Perl 6 to rewrite the grammar engine in Perl 6 (faster)
Current state of PGE • Handles concatenation, alternation, quantifiers, captures*, subpatterns, subrules • Capture semantics redefined in Dec 2004, still not final • To be added next • Character classes (note: Unicode) • Patterns containing scalars, arrays, hashes
P6 rule syntax • Changes from perl 5 • No more trailing /e, /x, /s options • [...] denotes non-capturing groups • ^ and $ are beginning/end of string • ^^ and $$ are beginning/end of line • . matches any character, including newline • \n and \N match newline/non-newline • # marks a comment (to end of line) • Quantifiers are *, +, ?, and **{m..n}
Character classes • [aeiou] changed to <[aeiou]> • [^0-9] now <-[0..9]> • Properties defined as • <alpha> • <digit> • <alnum> • Combine classes using +/- syntax: • <+<alpha>-[aeiou]>
Subrules • Patterns are now called "rules" • Analogous to subroutines and closures • Like {...}, /.../ compiles into a "rule" subroutine • P6 rule statement allows named rules: rule ident / [<alpha>|_] \w* /; • Named rules can be easily used in other rules: m / <ident> \:= (.*) /; rule expr / <term> [ <[+-]> <term> ]* /;
Interpolation • Variables no longer interpolate directly, thus / $var / matches the contents of $var literally, even if it contains rule metacharacters. (No \Q and \E) • To treat $var as a rule, use / <$var> / • Interpolated arrays match as an alternation: / @cmds / / [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
Interpolation, cont'd • Hashes match the keys of the hash, and the value of the hash is either • Executed if it is a closure • Treated as a subrule if it's a string or rule object • Succeeds if value is 1 • Fails for any other value • Useful for parsed languages rule expr / <term> [ %infixop <expr> ]? /
< metasyntax > • The < ... > introduce various forms of metasyntax • A leading alphabetic character indicates a subrule or grammatical assertion <alpha> <expr> <before pattern> <after pattern> • A leading ! negates the match <!before pattern>
< metasyntax > • Leading ' matches a literal string <'match this exactly (whitespace matters)'> • Leading " matches an interpolated string <"match $THIS exactly (whitespace matters)"> • Leading '+' or '-' are character classes /<-[a..z]> <-<alpha>>/
< metacharacters > • Leading '(' indicates code assertion /(\d**{1..3}) <( $1 < 256 )>/ # (fail if $1 is not less than 256) • A $, @, or % indicates a variable subrule, where each value (or key) is a subrule to be matched <$myrule> <@cmds> <%commands>
A cool and somewhat scary example %cmd{'^\d+'} = { say "You entered a number" }; %cmd{'^hello'} = { say "world" }; %cmd{'^print \s (.*)'} = { say $1; }; %cmd{'^exit'} = { exit() }; while =$*IN { /<%cmd>/ || say "Unrecognized command"; }
Backtracking control • Single colons skip previous atom m/ \( <expr> [ , <expr> ]* : \) / (if we don't find closing paren, no point in trying to match fewer <expr>s) • Two colons break an alternation: m:w/ [ if :: <expr> <block> | for :: <list> <block> | loop :: <loop_controls>? <block> ] (once we've found "if", "for", or "loop", no point in trying the other branches of the alternation)
Backtracking control • Three colons (:::) fail the current rule • The <commit> assertion fails the entire match (including any rules that called the current rule) • The <cut> assertion matches successfully, removes the matched portion of the string up to the <cut>, and if backtracked over fails the match entirely • Useful for throwing away successfully processed input when matching from an input stream • Like, say, when writing a compiler :-)
Backslash • \L, \U, \Q, \E, \A, \z gone from rules • \n and \N match newline/not newline • \s matches any Unicode space • backreferences are gone, use $1, $2, $3 (non-interpolated) • Perl 6 allows defining custom backslash sequences for use in rules
Closures • Anything in curlies is executed as a Perl 6 closure / (\w+) { say "Got $1"; } /
Capture semantics • Captures are different in Perl 6 • The result of a match is a "match object" • If a match succeeds, the match object has: • Boolean value true • Numeric value 1 (except for global matches) • String value the matched substring • Array component is matched subpatterns • Hash component is matched subrules
Subpattern captures • Part of a rule in parenthesis is a subpattern • Each subpattern produces its own match object /Scooby (dooby) (doo)!/ $1 $2 • Quantified subpatterns produce arrays of match objects: /Scooby (\w+ \s+)* (doo)!/ $1 $2 $1 is a (possibly empty) array of matches
Non-capturing groups • Brackets do not capture, thus they don't result in a match object /Scooby [ (\w+ \s+)* (doo) ]!/ $1 $2 • Quantified brackets replace nested subpatterns with the last component matched: /Scooby [ (\w+ \s+)* (doo) ]+ !/ $1 $2
Nested capturing subpatterns • Each capturing subpattern introduces a new lexical scope, with nested captures inside the new match object: /Scooby ( (\w+ \s+)* (doo) ) !/ $1[0] $1[1] <-------- $1 --------->
Alternations • Alternations introduce a new lexical scope, thus subpatterns restart counting at zero for each alternative branch (unlike p5): $1 $2 m/ Scooby (dooby)* (doo)! | Yabba (dabba)* (doo) / $1 $2 This avoids lots of empty subpatterns when an alternation doesn't match.
Subrules • Subrules capture into a hash keyed by the name of the subrule: rule ident / [<alpha>|_] \w* /; rule num / \d+ /; m/ <ident> \:= <num> /; places match objects into $<ident> and $<num>
Quantified subrules • Like subpatterns, quantified subrules produce arrays of matches m:w / dir <file>* / produces matches in $<file>[0], $<file>[1], etc. • Nested parens in a subrule capture to the subrule's match object
Named captures • Portions of a match can be captured directly into a match object without a subrule: m:w/ $<name> := \w+ , <$val> := \d+ / captures the first sequence of alphanumerics into $<name>, and digits following the comma into $<val>.
Grammars • Rules can be packaged together into separate name spaces to form Grammars grammar Perl6 { rule ident { ... }; rule term { ... }; rule expr { ... }; }
:parsetree • The :parsetree flag to a rule causes the grammar engine to keep all information about a match. • Thus, one can do something like $parse = ($source ~~ Perl6::program); to get the entire parsetree for a program (including comments)