120 likes | 251 Views
ReBug: A Regex Debugger. Michel Lambert mlambert@mit.edu Massachusetts Institute of Technology http://perl.jall.org/rebug/ Perl Conference 5, Grande Ballroom B. The Basic Idea. Three basic parts: Instrument the regex (aka: debuggerizing) Run the regex Analyze the data returned.
E N D
ReBug: A Regex Debugger Michel Lambert mlambert@mit.edu Massachusetts Institute of Technology http://perl.jall.org/rebug/ Perl Conference 5, Grande Ballroom B
The Basic Idea • Three basic parts: • Instrument the regex (aka: debuggerizing) • Run the regex • Analyze the data returned
A New Feature • Perl 5.6.0’s new regex operator: (?{}) • Perl-in-a-regex • Called every time the token is matched • To find out how far ‘through’ a regex we are, we can study the order the callbacks get called
Instrument the Regex • Adding the tokens • Many tokens needed to see the match • /a/ becomes: • /(?{callback()})a(?{callback()})/ • /c*d/ becomes: • (?:c(?:{callback()}))*d(?:{callback()})/ • Requires that we parse the regex entirely
Parsing the Regex • Regexes have a simple language • Linear token stream • Insert (?{callback()}) after each token • Parenthesized expression is a ‘nested’ token • Parse it recursively, tokenizing the subexpression • Regex::Tokenizer creates a stream of tokens • Regex::Debuggerizer creates instrumented regex
Regex::Tokenizer • Regex ‘language’: • regex = item* • item = token quantifier • token = char, char-class, nested token • quantifier = * + ? *? +? ?? {3} {3,} {3,5} • nested token = (?:a*) (?>b) (abc) (?!d)
quantifier: [*+?]?\?? | \{ \d+(?:,\d*) \} nested token prefix: \?(?: [:=!>] | <[=!] ) | (?=[^?]) matching parenthesis: lazily-evaled regexes $parens = qr{ \( (?: (?>(?:\\. | [^()] )+ ) | (??{ $parens }) )* \)}x; Regex::Tokenizer
Extracting Information • Dependant upon the debugger’s feature set • Target string information • $`, $&, $’, $1 • Querying these variables during the regex match works perfectly fine • Regex information • Current place in regex, and the current token • No easy way, but by encoding data during debuggerizing, we can give the callback additional information about the state of the regex at that point
Additional Features • Step/Go Forwards….and Backwards • Less state information with a regex machine • Can easily record series of regex snapshots to allow freeform time travel through the match • Should be independent of flaws in regex • ‘Infinite loop’ regexes should be debuggable • Through forking the parsing into a regex matching backend and a responsive Tk frontend, IPC can allow us to communicate during the regex match
How it Works • The debugger engine waits for ‘match-next-token’ • The frontend asks for new state data as needed, and stores retrieved data (real regex matches can’t be rewound) • The user uses VCR controls to interface • Colored text highlighting displays data
Demonstration • The fun part • Files: • /rebug.pl – the simple wrapper around the modules • IPC::Meiosis – splits program into front/backend • IPC::Talk – communication interface • Regex::Tokenizer – tokenizes the regex • Regex::Debuggerizer – instruments regex with Regex::Tokenizer • Regex::Debugger::State – debugger’s state object • Regex::Debugger – backend: handles the regex match and state encapsulation • Regex::Interface – frontend: provides the Tk interface code and querying logic
Screenshots (Plan B) • To be completed this week…