200 likes | 314 Views
Approximate Validity of XML Streaming Data. HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II. Motivation. Streaming Data from different sources Approximate decisions Correct Robust
E N D
Approximate Validity of XML Streaming Data HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II
Motivation • Streaming Data from different sources • Approximate decisions • Correct • Robust • Statistics based computations
Plan • Generalized Statistics On Trees • Statistics allow Approximate validity on words and trees based on Property Testing (Edit Distance with Moves) • Property testing for regular tree languages (ICALP 2004) (.pdf), • Approximate Satisfiability and Equivalence (.pdf) (LICS 06) • Approximate validity on Streaming data
Edit Distances with Moves • Classical Edit Distance: Insertions, Deletions, Modifications • Edit Distance with moves 0111000011110011001 0111011110000011001 • Edit Distance with Moves generalizes to Ordered Trees
Statistics on words (k-gram) • word W,length n, n-k+1 blocks, of length k=1/ε • For W=001010101110 k=2, n-k+1=11,
Statistics for unranked ordered trees Transformation: Rabin Encoding a a b b b b d b d d b d d d d d Unranked tree Extended 2-ranked tree
Statistics on Trees: generalized k-gram b w a w a a w a We abbreviate “author” by a, “db” by b , “work” by w Types of Sub-paths 00 01 10 11 00 01 10 11
2. Approximate validity based on Property Testing Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: • If U |= F, A accepts • If U is ε far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. • For all ε it is anε -tester for F • Time(A) independent of n= |U| . Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996. Tester usually implies a linear time corrector. (ε1, ε2)-Tolerant Tester.
Regular membership on words H={u.stat(w) : w in r } is a union of polytopes. 2 Polytopes for r. Y(w) Membership Tester:
3. Streaming Data The goal: Decide if a given XML file is -valid for a DTD Our work: Propose an algorithm to get a statistic matrix sustat(t), which approximates the matrix ustat(t) ,using constant space
Data structure for Streaming Data a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> a b b c c d e f d j h i e h i g g k j f k Data Structure
Unbounded data structures a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> b a c b c e f d g d h i e j h i j g f k k
Bounded data Structure a b c d e h i Suppose the length of the queues is limited to 4 g j f k Some of the matrix entries will be missing constant
Streaming algorithm • Definition: a k-fork is a node with 2 distinct paths of length more than 2k. • Streaming algorithm: • Input: <a>bounded push/update sustat(t) </a>pop/recover/update sustat(t) • Output: matrix sustat(t) k=3 Entries missed: b-f-d… Entries recovered: d-c-d…
Streaming algorithm • Key Lemma : #forks • Theorem :sustat(t) approximates ustat(t) If Memory=2*k,
Approximate validity on streaming data Streaming test(Memory = 2*k): Y(t) ustat(t) sustat(t) DTD
Results: http://www.up2.fr/xmlstream/ Gstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/
Results: http://www.up2.fr/xmlstream/ Lstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/
Conclusion • Statistics of trees: • Generalization of a k-gram • Easy to compute on a DOM • Approximate statistics on Streaming Data • Approximate validity • Data Exchange • Data Integration