250 likes | 345 Views
Typed and Unambiguous Pattern Matching on Strings using Regular Expressions. Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University). DANSAS 2010 (In proc. of PPDP 2010). [http://xkcd.com/208/]. Main Message. For regular expressions : Pattern matching
E N D
Typed and UnambiguousPatternMatchingonStringsusingRegularExpressions Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University) DANSAS 2010 (In proc. of PPDP 2010) [http://xkcd.com/208/]
Main Message For regularexpressions: • Patternmatching • Precisesyntax-directedambiguityanalysis • Typedmappinginto a targetlanguage
Introduction & Motivation • Parsing dynamic input is an ubiquitous problem • URLs: • Log Files: • The solution is patternmatching (list ofkey-value pairs) http://www.cs.au.dk/index.php?id=141&view=details protocol host path query-string 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64 post /search.html
Example • Example (date): • Matchingagainststring: • yields: <day= [0-9]{1,2} > "/" <month= [0-9]{1,2} > "/" <year= [0-9]{4}> [0-9]{1,2} "/" [0-9]{1,2}"/" [0-9]{4} "26/06/1992" day = 26 month = 06 year = 1992
Example • Example (date): • String2082010: • day = 2 and month = 08 (ie. 2nd of August) • day = 20 and month = 8 (ie. 20th of August) <day= [0-9]{1,2} > <month= [0-9]{1,2} > <year= [0-9]{4} > <day= [0-9]{1,2} > "/" <month= [0-9]{1,2} > "/" <year= [0-9]{4} >
Whyregularexpressions? • Expressive (enough) • Declarative • Decidableproperties • Wellknown
Outline • Oursetup • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion .
Our setup url.rex Compile (our tool) <URL = [a-z]*>; ... URL.java ... Foo.java URL.java Foo.java ... import URL; class Foo { ... } Compile (javac) URL.class Foo.class ...
Outline • Oursetup • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
RegularExpressions • Syntax: • Semantics: where: • L1 L2 is concatenation(i.e., { 1 2 | 1L1, 2L2 }) • L* = i0 Liwhere L0 = { } and Li = L Li-1 • Usualextensions : • Anycharacter ”.” asc1|c2|...|cn, ci • Character ranges ”[a-z]” asa|b|...|z • Repetitions ”R{2,3}” asRR|RRR
Recording • Syntax: • ” ” is a recordingidentifier • (it "remembers" the substring it matches) • Semantics: • Example(simplifiedemails): • Matchingagainststring: yields: <user=><domain=> [a-z]+ "@" [a-z]+ ("." [a-z]+)* "obama@whitehouse.gov" Related: "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP domain = "whitehouse.gov" user = "obama" &
Outline • Oursetup • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
Ambiguity • Example from before • matched on the string “208” gives rise to: • day = 2 and month = 08 (ie. 2nd of August) • day = 20 and month = 8 (ie. 20th of August) • Multiple ways of matching => ambiguous • Problem: Concatenation <day= [0-9]{1,2} > <month= [0-9]{1,2} > 2 0 8 day month
Ambiguityanalysis NB: sound & complete ! • Theorem: • Runambiguousiff Relatedwork: [Brabrand+Giegerich+Møller’09]: Similar approach for context free grammars. [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDucebut indirectly via NFA, not directly (syntax-directed).
Outline • Oursetup • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • Typemapping • Conclusion
2) Restriction: R1 - R2 L(R1 - R2) = L(R1) \ L(R2) 4)Default disambiguation: concat, choice, and star are all left-biased(by default) ! (Ourtooldoesthis) 1)Manual rewriting: Alwayspossible:-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3)Disambiguators: Threebasic operators choice: '|L','|R' concat: 'L','R' star:'*L','*R' Disambiguation <foo= a > | <bar = a* > <foo= a > | <bar = a* > is rewritten to using restriction <foo= a > | <bar =|aaa* > <foo= a > | <bar =a*-a> <foo= a > | <bar = a* > <foo= a > | <bar = a* > no need to rewrite using restriction we get • Relatedwork: [Vansummeren'06] but with global, not localdisambiguation <foo= a > |L <bar = a* >
Outline • Oursetup • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
Type Mapping • Our date example • Type of the recordings day, month, and year? • Strings (=> many type casts) • Infer the type <day= [0-9]{2} > "/" <month= [0-9]{2} > "/” <year= [0-9]{4} >
Type Mapping • A recording has three type components: • a linguistic type (language of the recording - maps to String, int, float, etc). • a structural type (nested recordings – maps to (nested) classes). • a type modifier (maps to lists). • Relatedwork: Exact type inference in XDuce & CDuce(soundness+completenessproof in [Vansummeren'06])but not for stand-alone and non-intrusiveusage (Java)
Type Mapping [0-9]+ [a-z]+ Person = <name=>" (" <age=>")" • Example class Person { // auto-generated Stringname; intage; static Person match(String s) { ... } public StringtoString() { ... } } compile (ourtool) • Usage String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old");
Conclusion Regularexpressionsarealive and well. Thispaper: • Used for patternmatching • Preciseambiguityanalysis • Type mapping Future work: improve performance, subtype of recordings "trade(excess) expressivity for safety+simplicity” Thankyou. Questions?
R R' T T' = Ambiguity • Definition: • Rambiguousiff T,T'ASTR: T T' ||T|| = ||T'|| • where ||||: AST * (the flattening) is:
Characterization of Ambiguity NB: sound & complete ! • Theorem: • Runambiguousiff R* = | RR*
Type Inference • Type Inference: • R:(L,S)