300 likes | 401 Views
Typed and Unambiguous Pattern Matching on Strings using R egular Expressions. Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University). PPDP 2010. [http ://xkcd.com/208 /]. Introduction & Motivation. Parsing dynamic input is an ubiquitous problem URLs : Log Files :
E N D
Typed and UnambiguousPatternMatchingonStringsusingRegularExpressions Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University) PPDP 2010 [http://xkcd.com/208/]
Introduction & Motivation • Parsing dynamic input is an ubiquitous problem • URLs: • Log Files: • The solution is patternmatching (list ofkey-value pairs) http://www.cs.au.dk/index.php?id=141&view=details protocol host path query-string 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64 post /search.html
Motivatingexample • Example: • Matchingagainststring: • yields: <day= [0-9]{2} > "/" <month= [0-9]{2} > "/" <year= [0-9]{4} > "26/06/1992" day= 26 month= 06 year= 1992
Our setup url.rex Compile (our tool) <URL = [a-z]*>; ... URL.java ... Foo.java URL.java Foo.java ... import URL; class Foo { ... } Compile (javac) URL.class Foo.class ...
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion .
The Chomsky Hierarchy(1956) • Languageclasses(+formalisms): • Type-3 regularexpressions "enough" for: • URLs, log files, ... • "Trade"(excess) expressivity for: • declarativity, simplicity, andstaticsafety ! • No static guarantees. • Example: java.net.URLhave had 88 bugs spanning a decade and source code still contains a //fixme Conceptually harder than regular expressions (regular expressions plus recursion). Not widely used. Simple, declarative and decidable properties(containment, ambiguity, etc.). Oldie but goodie
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
RegularExpressions • Syntax: • Semantics: where: • L1 L2 is concatenation(i.e., { 1 2 | 1L1, 2L2 }) • L* = i0 Liwhere L0 = { } and Li = L Li-1 • Usualextensions : • Anycharacter ”.” asc1|c2|...|cn, ci • Character ranges ”[a-z]” asa|b|...|z • Repetitions ”R{2,3}” asRR|RRR
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
Recording • Syntax: • ” ” is a recordingidentifier • (it "remembers" the substring it matches) • Semantics: • Example(simplifiedemails): • Matchingagainststring: yields: <user=><domain=> [a-z]+ "@" [a-z]+ ("." [a-z]+)* "obama@whitehouse.gov" Related: "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP domain = "whitehouse.gov" user = "obama" &
Recording(lists) • Anotherexample(yielding lists): • Matchingagainststring: • yields a list structure: ( <name= [a-z]+ > "\n" )* <name= [a-z]+ > " & " <name= [a-z]+ > <name= [a-z]+ > (" & " <name= [a-z]+ > )* "obama & bush" name = [obama,bush]
Recording (structured) • Yetanotherexample : • Matchingagainststring: • yields: <person = <name= [a-z]+ >", " <age =[0-9]+> > "obama, 48" person=obama, 48 person.name= obama Person.age = 48
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
Ambiguity • Some regular expressions are ambiguous: • matched on the string “101” gives rise to: • day = 1 and month = 01 (ie. 1stof January) • day = 10 and month = 1 (ie. 10th of January) • Multiple ways of matching => ambiguous <day= [0-9]{1,2} > <month= [0-9]{1,2} >
Characterization of Ambiguity • Theorem: • Runambiguousiff NB: sound & complete !
Characterization of Ambiguity • Theorem: • Runambiguousiff • and <foo= a > | <bar = a* > For the string”a”, 2 ways: foo= ”a”orbar = ”a”
Characterization of Ambiguity <foo= a* > <bar = a* > For the string”a”, 2 ways: foo= ”a”orbar = ”a” R* = | RR* Relatedwork: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDucebut indirectly via NFAa, not directly (syntax-directed). <foo=a|aa>* For the string”aa”, 2 ways: foo= [a,a]orfoo= [aa]
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • Typemapping • Conclusion
2) Restriction: R1 - R2 And thenencode...: RCas: * - R R1 & R2as:(R1C|R2C)C 4)Default disambiguation: concat, choice, and star are all left-biased(by default) ! (Ourtooldoesthis) 1)Manual rewriting: Alwayspossible:-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3)Disambiguators: Threebasic operators choice: '|L','|R' concat: 'L','R' star:'*L','*R' What to do about it? <foo= a > | <bar = a* > <foo= a > | <bar = a* > is rewritten to using restriction <foo= a > | <bar =|aaa* > <foo= a > | <bar =a*-a> <foo= a > | <bar = a* > <foo= a > | <bar = a* > no need to rewrite using restriction we get • Relatedwork: [Vansummeren'06] but with global, not localdisambiguation <foo= a > |L <bar = a* >
Outline • The Chomsky Hierarchy(1956) • RegularExpressions: • The Recording Construction • Ambiguity: • Disambiguation • TypeMapping • Conclusion
Type Mapping • Our date example • Type of the recordingsdate, day, month, and year? • Strings (=> many type casts) • Infer the type <date= <day= [0-9]{2} > "/" <month= [0-9]{2} > "/" <year= [0-9]{4} > >
Type Mapping • A recording has three type components: • a linguistic type (language of the recording - maps to String, int, float, etc). • a structural type (nested recordings – maps to (nested) classes). • a type modifier (maps to lists). • Relatedwork: Exact type inference in XDuce & CDuce(soundness+completenessproof in [Vansummeren'06])but not for stand-alone and non-intrusiveusage (Java)
Type Mapping [0-9]+ [a-z]+ Person = <name = > " (" <age= > ")" • Example class Person { // auto-generated Stringname; intage; static Person match(String s) { ... } public StringtoString() { ... } } compile (ourtool) • Usage String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old");
Type Mapping Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" • Usage: People = ( $Person "\n" )* classPeople { // auto-generated String[]name; int[]age; static Person match(String s) { ... } public StringtoString() { ... } } compile (our tool) String s = "obama (48)\n bush (63)\n "; People p = People.match(s); println("Secondname is " + p.name[1]);
Type Mapping Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" • Usage: People = ( <person= $Person >"\n" )* ; class People { // auto-generated Person[]person; class Person { // nested class String name; int age; } ... } compile (our tool) String s = "obama (48)\n bush (63)\n "; People people = People.match(s); for (p : people.person) println(p.name);
Conclusion Regularexpressionsarealive and well. Thispaper: • Preciseambiguityanalysis • Typemapping Future work: improve performance, subtype of recordings "trade(excess) expressivity for safety+simplicity” Thankyou. Questions?
R R' T T' = Ambiguity • Definition: • Rambiguousiff T,T'ASTR: T T' ||T|| = ||T'|| • where ||||: AST * (the flattening) is:
Characterization of Ambiguity NB: sound & complete ! • Theorem: • Runambiguousiff R* = | RR*
Type Inference • Type Inference: • R:(L,S)