470 likes | 493 Views
Explore the essential concepts of pattern matching using regular expressions, including Chomsky Hierarchy, regex syntax, recording construction, ambiguity, and more.
E N D
Pattern Matching on Stringsusing Regular Expressions Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )* Claus Brabrand [ brabrand@itu.dk ] IT University of Copenhagen Jakob G. Thomsen [ gedefar@cs.au.dk ] Aarhus University
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
Introduction & Motivation • Pattern matching an indispensable problem • Many applications need to "parse" dynamic input • 1) URLs: • 2) Log Files: • 3) DBLP: (list of key-value pairs) http://first.dk/index.php?id=141&view=details protocol host path query-string 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64 post /search.html <article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year> </article>
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
The Chomsky Hierarchy (1956) • Language classes (+formalisms): • Type-3 regular expressions "enough" for: • URLs, log files, DBLP, ... • "Trade" (excess) expressivity for: • declarativity, simplicity, andstatic safety !
Type-0: java.net.URL • Turing-Complete programming (e.g., Java) • [ "unrestricted grammars" (e.g., rewriting systems) ] • Cyclomatic complexity (of official "java.net.URL"): • 88 bug reports on Sun's Bug Repository ! • Bug reports span more than a decade !
Type-1: Context-Sensitivity • Not widely used (or studied?) formalism • Presumeably because: • Restricts expressivity w/o offering extra safety? - ? -
Type-2: Context-Free Grammars • Conceptually harder than regexps • Essentially (Type-3) Regular Expressions + recursion • The ultimate end-all scientific argument: • We d: (conjecture!) regexps 12 times more popular !
Type-?: Regexp Capture Groups • Capturing groups (Perl, PHP, Java regex, ...): • Syntax: (i.e., in parentheses) • Back-references: • Syntax: (i.e., "index of" capturing group) • Beyond regularity !: • is non-regular • In fact, not even context-free !!!: • is non-context-free (R) \7 (a*)b\1 { anban | n0 } { | , * } (.*).\1
Type-?: Regexp Capture Groups • Interpretation with back-tracking: • NP-complete (exponential worst-case): :-( regexp "a?nan " vs. string "an " 1 minute 0.02 msecs 3.000.000:1 on strings of length 29 !!!
Closure properties: Union Concatenation Iteration Restriction Intersection Complement ... Decidability properties: ... ... Containment: L(R) L(R') Ambiguity ... ... Type-3: Regular Expressions Simple ! Declarative ! Safe !
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
Regular Expressions • Syntax: • Semantics: where: • L1 L2 is concatenation(i.e., { 1 2 | 1L1, 2L2 }) • L* = i0 Li where L0 = { } and Li = L Li-1
Common Extensions (sugar) • Any character (aka, dot): • "." asc1|c2|...|cn, ci • Character ranges: • "[a-z]" asa|b|...|z • One-or-more regexps: • "R+" asRR* • Optional regexp: • "R?" as|R • Various repetitions; e.g.: • "R{2,3}"asRRR?
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
Recording • Syntax: • "x " is a recording identifier • (it "remembers" the substring it matches) • Semantics: • Example (simplified emails): • Matching against string: yields: NB: cannot use DFAs / NFAs ! - only recognition (yes / no) - not how (i.e., "the structure") [a-z]+ "@" [a-z]+ ("." [a-z]+)* <user=><domain=> "obama@whitehouse.gov" domain = "whitehouse.gov" user = "obama" &
Recording (structured) • Another example (with nested recordings): • Matching against string: • yields: <date= <day= [0-9]{2} > "/" <month= [0-9]{2} > "/" <year= [0-9]{4} > > "26/06/1992" date = 26/06/1992 date.day = 26 date.month = 06 date.year = 1992
Recording (structured, lists) • Yet another example (yielding lists): • Matching against string: • yields a list structure: <name= [a-z]+ > " & " <name= [a-z]+ > ( <name= [a-z]+ > "\n" )* <name= [a-z]+ > (" & " <name= [a-z]+ > )* "obama & bush" name = [obama,bush]
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
R R' T T' = Ambiguity • Definition: • Rambiguousiff T,T'ASTR: T T' ||T|| = ||T'|| • where ||||: AST * (the flattening) is:
Characterization of Ambiguity • Theorem: • Runambiguousiff NB: sound & complete ! R* = | RR*
Ambiguous: a|a L(a) L(a) = { a } Ø a*a* L(a*) L(a*) = { an } Ø Unambiguous: a|aa L(a) L(aa) = Ø a*ba* L(a*) L(ba*) = Ø Examples
Ambiguity Examples • a?b+|(ab)* • (a|ab)(ba|a) • (aa|aaa)* *** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab" *** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba" *** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
1)Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3)Disambiguators: From characterization: concat: 'L','R' choice: '|L','|R' star: '*L','*R' (partial-order on ASTs) 2) Restriction: R1 - R2 And then encode...: RCas: * - R R1 & R2as:(R1C|R2C)C 4)Default disamb: concat, choice, and star are all left-biassed (by default) ! (Our tool does this) Disambiguation
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
Type Inference • Type Inference: • R:(L,S)
Examples (Type Inference) • Regexp: • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old");
Examples (Type Inference) • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( $Person "\n" )* class People { // auto-generated String[]name; int[]age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48)\n bush (63)\n "; People p = People.match(s); println("Second name is " + p[1].name);
Examples (Type Inference) • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( <person= $Person >"\n" )* ; class People { // auto-generated Person[]person; class Person { // nested class String name; int age; } ... } compile (our tool) String s = "obama (48)\n bush (63)\n "; People people = People.match(s); for (p : people.person) println(p.name);
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
URLs • URLs: • Regexp: • Query string further structured (list of key-value pairs): (list of key-value pairs) "http://www.google.com/search?q=record&hl=en" protocol host path query-string (list of key-value pairs) Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; Query = <query= [a-z&=]* > ; URL = "http://" $Host "/" $Path "?" $Query ; KeyVal = <key= [a-z]* >"="<val= [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ;
URLs (Usage Example) • Regexp: • Usage (example): Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; KeyVal = <key= [a-z]* >"="<val= [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ; URL = "http://" $Host "/" $Path "?" $Query ; String s = "http://www.google.com/search?q=record"; URL url = URL.match(s); print("Host is: " + url.host); if (url.key.length>0) print("1st key: " + url.key[0]); for (String val : url.val) println("value = " + val);
Log Files Format 13/02/2010 66.249.65.107 /support.html 20/02/2010 42.116.32.64 /search.html ... Date = <date= <day= $Day > "/" <month= $Month > "/" <year= [0-9]{4} >>; IP = <ip= [0-9]{1,3} ("." [0-9]{1,3} ){3} >; Entry = <entry= $Date " " $IP " " $Path "\n">; Log = $Entry * ; Regexp Log log = Log.match(log_file); for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip); Usage
Log Files (cont'd, ambiguity) • Assume we forgot "/" (between day & month): • Ambiguity: • i.e. "1/01" (January 1) vs. "10/1" (January 10) :-) Regexp Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ; Month = 0?[1-9] | 10 | 11 | 12 ; Date = <date=<day=$Day>// no slash ! <month=$Month> "/" <year= [0-9]{4} > > ; Error *** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101"
DBLP (Format) • DBLP (XML) Format: <article> <author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal> </article> <article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note> </article> ...
DBLP (Regexp) • DBLP Regexp: • Ambiguity !: • EITHER 2 publications (.* = "") • OR 1 publication (.* = gray part) !!! Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub= $Article > * ; *** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>"
DBLP (Disambiguated) • DBLP Regexp: • Disambiguated (using "(R1-R2)"): • Unambiguous! :-) Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub= $Article > * ; Article = "<article>" $Author* $Title (.*-(.* "</article>" .*)) "</article>" ;
DBLP (Usage Example) • DBLP Regexp: • Usage (example): Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <article= $Article > * ; DBLP dblp = DBLP.match(readXMLfile("DBLP.xml")); for (Article a: dblp.article) print("Title: " + a.title);
Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion
Evaluation • Evaluation summary: • Also, (Type-3) regexps expressive "enough" • for: URLs, Log files, DBLP, ... [ Frisch&Cardelli'04 ] [ NP-Complete ] [ MatMult ]
Type-3 vs. Type-0 (URLs) • Regexps vs. Java: Regexps are 8 times more concise !
java.util.regex vs. Our approach • Efficiency(on DBLP): • java.util.regex: • Exponential O(2||) 2,500 chars in 2 mins ! • In contrast; ours: • Linear (on DBLP) 1,200,000 chars in 6 secs ! 2 mins 10 msecs
Related Work • Recording (with lists in general): • "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP • Ambiguity: • [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed) • Disambiguation: • [Vansummeren'06] but with global, not local disambiguation • Type inference: • Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)
Conclusion • For string pattern matching, it is possible to: • In conclusion: • i.e., ambiguity checking and type inference ! • + stand-alone &non-intrusive language integration (Java) ! "trade (excess) expressivity for safety+simplicity" We conclude that ifregular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner.
</Talk> [ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ] Questions ? Complaints ?