1 / 31

UPA and Restriction for All-Groups and Numeric Exponents

This article discusses the introduction of numeric exponents by the W3C XML Schema WG, the subsumption relation of the restriction on content models, and the challenges and algorithms for UPA testing. It also explores special considerations, internal consistency, and subsumption for exponents.

lonniem
Download Presentation

UPA and Restriction for All-Groups and Numeric Exponents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UPA and Restriction for All-Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft

  2. Why Bother? • Numeric Exponents introduced by the W3C XML Schema WG. • Restriction is a subsumption relation among content models. • And-groups long cherished by Markup Community. • UPA is an old constraint on content models in WXS. • What is the cost of combination?

  3. Naïve Algorithms • Exponential or worse: • All-groups try all exponential cases. • Numeric exponents – unroll - doubly exponential: • First unroll: (a{0,3} | b){10, 20} => ((a | aa | aaa | b)…(a |…)…. • Then determinise. • Used by XSV, Xerces, Sun. • To not try to do better is simply remiss.

  4. UPA Testing • Generally just need to check follow sets. • Problem for numeric exponents for a{m,m}. • For example: • (a1,b2){2,2},a3 => ababa • ((a1, b2){1,3},a3) => aba or ababa or abababa • Is a1 in follow(b2)?

  5. Problem for All-groups • Again, are different branches in each others’ follow groups? • (a & b & c) => follow(a) = {b, c} • (a & b? & c) => follow(a) = {b, c} union follow(b) => {a, b, c} • ((a,b?) & b &c) => violates UPA

  6. Five properties of particles • particles(p) => all particles within p, recursively defined. • opaque(p) => a particle is opaque if it can’t match the empty string. • first(p) => particles in p that can match first letter in a string matching p • follow(p) => particles in the outer expression that can match a letter in a string after substring matched by p. • confusion(p) => particles in p which could conflict with follow(p) (a, b?) => b is in confusion((a, b?))

  7. Special Considerations • follow(p) restricted as follows: • (((a?,b){m,m}),c) => follow(b) = {c} • (((a?,b){m, n}),c) => follow(b) = {c, a, b} • ((a & b & c), d) => follow(c) = {d}

  8. Sources of UPA Violation Consider P in (a, b {0,1}, P, g) (a, (b | P), g) (a, (b & P), g) UPA violation requires 2 terminals: One before P, one inside P – need first(P) Both inside P – in a moment One inside P, one after P – need confusion(P) One before P, one after P – opacity(P) is false

  9. Internal Consistency • P{m, m} – if P obeys UPA, then confusion(P) intersection first(P) != {} • If P is (a & b & g) then • overlap in first sets • confusion(a) intersects (first(b) U first(g) != {} • And so on for b and g

  10. UPA Algorithm UPA(a) => • a = a then if bi, bj in follow(a), then i=j • a = b{m,n} the UPA(b) and first(b) # confusion(b) = {} • a = (b1 |…| bn) and #1n first(bi) = {} then /\1n UPA(bn). • a =(b1 & … & bn) and #1n first(bi) = {} then • /\1n(UPA(bi)) and (confusion(bi) # (Uj!=Ifirst(bj)) = {} • a =(b1 , … , bn) then UPA(b1) /\ UPA((b2, …, bn))

  11. Subsumption for Exponents • Two steps • For fixed exponents • For exponent ranges • Most equipment carries over • Will use B or b to refer to base model, and R or r to refer to restricted model

  12. Traditional • Subsumption through transformation into automaton. • Calculate intersection of automata (R intersects not(B)) should be empty (not(B) is the inversion of the accepting states of B). • Once again, too huge when everything is unrolled.

  13. Our Machines • Represent regex as graph. • Forward edges, matching terminals, form a DAG • Back edges, matching exponents, form connected components. • Each back edge marked with its arity.

  14. Execution Model • Letters are matched going forward by edges. • Machine is “trapped” when a back-edge is entered. • Can’t leave until obligation (value of back edges) fulfilled. • Edge constraints fulfilled in lifo order. • Stack maintains current iterations.

  15. Example • (a,((a,b)2|b))2 2 2 b a a b

  16. Subsumption Checking • Start as usual. • When entering head of a back edge, add entry to machine’s stack. • When both reach repeated state: • Tail of a back edge • Previously seen in list of traversed states • Determine if there is a matched component • Maximally reduce exponents for matched edges

  17. For Example • (a,(a,b,a,b)6,b3,c) <= (a,((a,b)2|b)9,c) • (r, b) let (r, b) r b • (0,0) a (1,1) [], [] • (1,1) a (2,2) [0], [0,0] • (2,2) b (3,3) [0], [1,0] a a b a b b c • (3,1) a (4,2) [0], [1,0] • (4,2) b (5,3) [1], [2,0] • (5,3) dXd (5,1) [], [6] b c • (5,1) b (6,3) [1], [] a c • (6,3) dXd (6,3) [], [] b • (6,3) c (7,4) [], [] 2 9

  18. Reducing Exponents • Find cross-product back-edge (startr and startb) • Get dr and db (number iterations each) • Get leftover (totalr – startr) = lr • lr div dr = quotr and remr, etc. • newr = lr – (dr * min(quotr, quotb)) +startr

  19. Why So Complicated • Compare (a,a,a)7 and (a, a)12 • Must go 3 rounds of (a,a) for 2 rounds of (a,a,a). • lr = 7 lb = 12 • dr = 2 db = 3 • lr div dr = 3 rem 1 lb div db = 4 rem 0 • newr=7–(2*3)+0=1 newb=12-(3*3)+1=3 • Hence, max 6 rounds of (a, a, a) and 9 of (a, a).

  20. Generalized Exponents • Must keep track of minimum and maximum possible transitions. • Edges can contribute to both min or max. • Can’t exit until max > min allowed. • Must exit before min > max allowed.

  21. So…. • Generate as few minr/b as possible. • If they exceed maxr/b, you’re screwed • Generate as many maxr/b as possible • Means you can use a forward transition • Use parsimoniously to maximize the amount matched

  22. More Complex Machinery • Back edge constraints have min and max. • Some back edges increment just max value • Back edges increment both min and max values. • Max means maximun possible match. • Min means minimum possible match.

  23. Example • ((a, b?){3, 5}, c) 3,5 3,5 b a c c

  24. Four Kinds of Pairs • When hitting a min-edge/min-edge: • Calculate min/min values (prev. algorithm with min exponents) • Calculate max/max values (prev. algorithm with max exponents) • Move forward when possible • If min ever exceeds max, fail. • When hitting a max-edge/max-edge • Calculate min/min values • Calculate max/max values • When max > min, you can progress (when leaving a cycle set min to passing value) • Else fail. • Etc.

  25. After exiting loop, some iterations remain. • As all “unabsorbed” transitions attempted, all possibilities tried. • Given ( ){mb,nb} • And ( ) {m’r,n’r} ,( ) {m”r,n”r} • Ensure m’r+m”r> mb and n’r+n”r< nb

  26. If “rest of expression” matches longest and shortest (i.e., matched m or matched n) then will match all iterations. • Matching longest will try all alternatives. • Matching shortest will try least alternatives. • As first sets repeat, UPA shows there must be optionality or iteration.

  27. Nested Exponents • (a){m,n}{m’,n’} • (a{m,n} | b){m’, n’} • Edges in machine have multiple exponents. • Depth of n makes 2(n-1) ranges • Each must be tried • Requires tracking scope. • Requires lookahead.

  28. Cost • Without nesting, algorithm is exponential in number of exponents – each exponent requires testing min and max. • With nesting, remains exponential, as this doesn’t affect the number of exponents. • Still a huge improvement over unrolling.

  29. Example • ((a?,b{8,9}){2,3},c) > (a,(b,b){3,3},(b,b){6,6},c) • First 6 b’s at level 2, remaining 12 iterate both levels • At higher levels ranges overlap – need to check all possibilities 3 6 c0 a1 a b b b b c b2 {8,9}{2,3} b2 c0 {8,9}{2,3}

  30. ((a?,b{8,9}){2,9},c) > (a,(b,b){3,3},(b,b){6,6},c) • 8*9=72, 9*8=72 • Need to check ending of 8 and start of 9 • Need lookahead to choose. • Represented as ranges at all levels. c0 a1 a b b b b a b2 {8,9}{2,9} b2 c0 {8,9}{2,9}

  31. Conclusions • Numeric exponents are hard to work with for subsumption. • All-groups are not that difficult. • Interaction will be even more annoying. • Need to implement and test.

More Related