310 likes | 378 Views
UPA and Restriction for All-Groups and Numeric Exponents. Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft. Why Bother?. Numeric Exponents introduced by the W3C XML Schema WG. Restriction is a subsumption relation among content models.
E N D
UPA and Restriction for All-Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft
Why Bother? • Numeric Exponents introduced by the W3C XML Schema WG. • Restriction is a subsumption relation among content models. • And-groups long cherished by Markup Community. • UPA is an old constraint on content models in WXS. • What is the cost of combination?
Naïve Algorithms • Exponential or worse: • All-groups try all exponential cases. • Numeric exponents – unroll - doubly exponential: • First unroll: (a{0,3} | b){10, 20} => ((a | aa | aaa | b)…(a |…)…. • Then determinise. • Used by XSV, Xerces, Sun. • To not try to do better is simply remiss.
UPA Testing • Generally just need to check follow sets. • Problem for numeric exponents for a{m,m}. • For example: • (a1,b2){2,2},a3 => ababa • ((a1, b2){1,3},a3) => aba or ababa or abababa • Is a1 in follow(b2)?
Problem for All-groups • Again, are different branches in each others’ follow groups? • (a & b & c) => follow(a) = {b, c} • (a & b? & c) => follow(a) = {b, c} union follow(b) => {a, b, c} • ((a,b?) & b &c) => violates UPA
Five properties of particles • particles(p) => all particles within p, recursively defined. • opaque(p) => a particle is opaque if it can’t match the empty string. • first(p) => particles in p that can match first letter in a string matching p • follow(p) => particles in the outer expression that can match a letter in a string after substring matched by p. • confusion(p) => particles in p which could conflict with follow(p) (a, b?) => b is in confusion((a, b?))
Special Considerations • follow(p) restricted as follows: • (((a?,b){m,m}),c) => follow(b) = {c} • (((a?,b){m, n}),c) => follow(b) = {c, a, b} • ((a & b & c), d) => follow(c) = {d}
Sources of UPA Violation Consider P in (a, b {0,1}, P, g) (a, (b | P), g) (a, (b & P), g) UPA violation requires 2 terminals: One before P, one inside P – need first(P) Both inside P – in a moment One inside P, one after P – need confusion(P) One before P, one after P – opacity(P) is false
Internal Consistency • P{m, m} – if P obeys UPA, then confusion(P) intersection first(P) != {} • If P is (a & b & g) then • overlap in first sets • confusion(a) intersects (first(b) U first(g) != {} • And so on for b and g
UPA Algorithm UPA(a) => • a = a then if bi, bj in follow(a), then i=j • a = b{m,n} the UPA(b) and first(b) # confusion(b) = {} • a = (b1 |…| bn) and #1n first(bi) = {} then /\1n UPA(bn). • a =(b1 & … & bn) and #1n first(bi) = {} then • /\1n(UPA(bi)) and (confusion(bi) # (Uj!=Ifirst(bj)) = {} • a =(b1 , … , bn) then UPA(b1) /\ UPA((b2, …, bn))
Subsumption for Exponents • Two steps • For fixed exponents • For exponent ranges • Most equipment carries over • Will use B or b to refer to base model, and R or r to refer to restricted model
Traditional • Subsumption through transformation into automaton. • Calculate intersection of automata (R intersects not(B)) should be empty (not(B) is the inversion of the accepting states of B). • Once again, too huge when everything is unrolled.
Our Machines • Represent regex as graph. • Forward edges, matching terminals, form a DAG • Back edges, matching exponents, form connected components. • Each back edge marked with its arity.
Execution Model • Letters are matched going forward by edges. • Machine is “trapped” when a back-edge is entered. • Can’t leave until obligation (value of back edges) fulfilled. • Edge constraints fulfilled in lifo order. • Stack maintains current iterations.
Example • (a,((a,b)2|b))2 2 2 b a a b
Subsumption Checking • Start as usual. • When entering head of a back edge, add entry to machine’s stack. • When both reach repeated state: • Tail of a back edge • Previously seen in list of traversed states • Determine if there is a matched component • Maximally reduce exponents for matched edges
For Example • (a,(a,b,a,b)6,b3,c) <= (a,((a,b)2|b)9,c) • (r, b) let (r, b) r b • (0,0) a (1,1) [], [] • (1,1) a (2,2) [0], [0,0] • (2,2) b (3,3) [0], [1,0] a a b a b b c • (3,1) a (4,2) [0], [1,0] • (4,2) b (5,3) [1], [2,0] • (5,3) dXd (5,1) [], [6] b c • (5,1) b (6,3) [1], [] a c • (6,3) dXd (6,3) [], [] b • (6,3) c (7,4) [], [] 2 9
Reducing Exponents • Find cross-product back-edge (startr and startb) • Get dr and db (number iterations each) • Get leftover (totalr – startr) = lr • lr div dr = quotr and remr, etc. • newr = lr – (dr * min(quotr, quotb)) +startr
Why So Complicated • Compare (a,a,a)7 and (a, a)12 • Must go 3 rounds of (a,a) for 2 rounds of (a,a,a). • lr = 7 lb = 12 • dr = 2 db = 3 • lr div dr = 3 rem 1 lb div db = 4 rem 0 • newr=7–(2*3)+0=1 newb=12-(3*3)+1=3 • Hence, max 6 rounds of (a, a, a) and 9 of (a, a).
Generalized Exponents • Must keep track of minimum and maximum possible transitions. • Edges can contribute to both min or max. • Can’t exit until max > min allowed. • Must exit before min > max allowed.
So…. • Generate as few minr/b as possible. • If they exceed maxr/b, you’re screwed • Generate as many maxr/b as possible • Means you can use a forward transition • Use parsimoniously to maximize the amount matched
More Complex Machinery • Back edge constraints have min and max. • Some back edges increment just max value • Back edges increment both min and max values. • Max means maximun possible match. • Min means minimum possible match.
Example • ((a, b?){3, 5}, c) 3,5 3,5 b a c c
Four Kinds of Pairs • When hitting a min-edge/min-edge: • Calculate min/min values (prev. algorithm with min exponents) • Calculate max/max values (prev. algorithm with max exponents) • Move forward when possible • If min ever exceeds max, fail. • When hitting a max-edge/max-edge • Calculate min/min values • Calculate max/max values • When max > min, you can progress (when leaving a cycle set min to passing value) • Else fail. • Etc.
After exiting loop, some iterations remain. • As all “unabsorbed” transitions attempted, all possibilities tried. • Given ( ){mb,nb} • And ( ) {m’r,n’r} ,( ) {m”r,n”r} • Ensure m’r+m”r> mb and n’r+n”r< nb
If “rest of expression” matches longest and shortest (i.e., matched m or matched n) then will match all iterations. • Matching longest will try all alternatives. • Matching shortest will try least alternatives. • As first sets repeat, UPA shows there must be optionality or iteration.
Nested Exponents • (a){m,n}{m’,n’} • (a{m,n} | b){m’, n’} • Edges in machine have multiple exponents. • Depth of n makes 2(n-1) ranges • Each must be tried • Requires tracking scope. • Requires lookahead.
Cost • Without nesting, algorithm is exponential in number of exponents – each exponent requires testing min and max. • With nesting, remains exponential, as this doesn’t affect the number of exponents. • Still a huge improvement over unrolling.
Example • ((a?,b{8,9}){2,3},c) > (a,(b,b){3,3},(b,b){6,6},c) • First 6 b’s at level 2, remaining 12 iterate both levels • At higher levels ranges overlap – need to check all possibilities 3 6 c0 a1 a b b b b c b2 {8,9}{2,3} b2 c0 {8,9}{2,3}
((a?,b{8,9}){2,9},c) > (a,(b,b){3,3},(b,b){6,6},c) • 8*9=72, 9*8=72 • Need to check ending of 8 and start of 9 • Need lookahead to choose. • Represented as ranges at all levels. c0 a1 a b b b b a b2 {8,9}{2,9} b2 c0 {8,9}{2,9}
Conclusions • Numeric exponents are hard to work with for subsumption. • All-groups are not that difficult. • Interaction will be even more annoying. • Need to implement and test.