1 / 20

Approximate Validity of XML Streaming Data

Approximate Validity of XML Streaming Data. HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II. Motivation. Streaming Data from different sources Approximate decisions Correct Robust

Download Presentation

Approximate Validity of XML Streaming Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Validity of XML Streaming Data HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II

  2. Motivation • Streaming Data from different sources • Approximate decisions • Correct • Robust • Statistics based computations

  3. Plan • Generalized Statistics On Trees • Statistics allow Approximate validity on words and trees based on Property Testing (Edit Distance with Moves) • Property testing for regular tree languages (ICALP 2004) (.pdf), • Approximate Satisfiability and Equivalence (.pdf) (LICS 06) • Approximate validity on Streaming data

  4. Edit Distances with Moves • Classical Edit Distance: Insertions, Deletions, Modifications • Edit Distance with moves 0111000011110011001 0111011110000011001 • Edit Distance with Moves generalizes to Ordered Trees

  5. Statistics on words (k-gram) • word W,length n, n-k+1 blocks, of length k=1/ε • For W=001010101110 k=2, n-k+1=11,

  6. Statistics for unranked ordered trees Transformation: Rabin Encoding a a b b b b d b d d b d d d d d Unranked tree Extended 2-ranked tree

  7. Statistics on Trees: generalized k-gram b w a w a a w a We abbreviate “author” by a, “db” by b , “work” by w Types of Sub-paths 00 01 10 11 00 01 10 11

  8. Statistics on Trees: generalized k-gram type

  9. 2. Approximate validity based on Property Testing Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: • If U |= F, A accepts • If U is ε far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. • For all ε it is anε -tester for F • Time(A) independent of n= |U| . Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996. Tester usually implies a linear time corrector. (ε1, ε2)-Tolerant Tester.

  10. Regular membership on words H={u.stat(w) : w in r } is a union of polytopes. 2 Polytopes for r. Y(w) Membership Tester:

  11. 3. Streaming Data The goal: Decide if a given XML file is -valid for a DTD Our work: Propose an algorithm to get a statistic matrix sustat(t), which approximates the matrix ustat(t) ,using constant space

  12. Data structure for Streaming Data a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> a b b c c d e f d j h i e h i g g k j f k Data Structure

  13. Unbounded data structures a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> b a c b c e f d g d h i e j h i j g f k k

  14. Bounded data Structure a b c d e h i Suppose the length of the queues is limited to 4 g j f k Some of the matrix entries will be missing constant

  15. Streaming algorithm • Definition: a k-fork is a node with 2 distinct paths of length more than 2k. • Streaming algorithm: • Input: <a>bounded push/update sustat(t) </a>pop/recover/update sustat(t) • Output: matrix sustat(t) k=3 Entries missed: b-f-d… Entries recovered: d-c-d…

  16. Streaming algorithm • Key Lemma : #forks • Theorem :sustat(t) approximates ustat(t) If Memory=2*k,

  17. Approximate validity on streaming data Streaming test(Memory = 2*k): Y(t) ustat(t) sustat(t) DTD

  18. Results: http://www.up2.fr/xmlstream/ Gstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/

  19. Results: http://www.up2.fr/xmlstream/ Lstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/

  20. Conclusion • Statistics of trees: • Generalization of a k-gram • Easy to compute on a DOM • Approximate statistics on Streaming Data • Approximate validity • Data Exchange • Data Integration

More Related