1 / 17

Efficient Maintenance of Semistructured Schema

Efficient Maintenance of Semistructured Schema. Katsaros Dimitrios Aristotle University of Thessaloniki Hellas. Introduction (1/ 3 ). Semistructured data Sources: HTML, BibTeX, SGML, etc. Characteristics: no rigid structure, but some implicit structure, i.e., “schema”

vicky
Download Presentation

Efficient Maintenance of Semistructured Schema

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University of Thessaloniki Hellas

  2. Introduction (1/3) • Semistructured data • Sources:HTML, BibTeX, SGML, etc. • Characteristics:no rigid structure, but some implicit structure, i.e., “schema” • Knowledge of the “schema” is crucial: • Querying/browsing information sources • Building indexes/views • Storage in relational/object-oriented databases • Query processing

  3. OEM db Movie Movie Movie &1 &2 &3 Review Title Director Title Director Title Director Award Name Nationality Nationality Name Nationality Name Biography Introduction (2/3) Figure 1: Semistructured “movie” objects

  4. Introduction (3/3) • Discovering the common “schema” • Large volume / Irregularity of data • Solution: Mining the “schema” • Scalable / Can deal with irregularity • Association rules proposed by Wang & Liu [6] • Issue: How to deal with dynamic data ?

  5. Motivation Our contributions • Maintenance of the discovered schema under insertions of new objects • Schema for the new objects. • Performance evaluation of the method.

  6. Presentation Outline • Problem definition • Algorithm’s description • Performance evaluation • Conclusion • References

  7. Object Exchange Model • An Object Exchange Model (OEM) object • Identifier o (i.e., &o) • Value • Atomic (integer, float, string) • Complex • List: l1:&o1, l2:&o2, …, lk:&ok • Bag: {l1:&o1, l2:&o2, …, lk:&ok} where: li are labels (“roles”) ? denotes the wild card matching any label  is the nil structure that contains no label

  8. Tree-Expressions Definition • The nil structure is a tree-expression • Let tei be tree-expressions of objects oi. If val(o)= l1:&o1, l2:&o2, …, lk:&ok and i1, i2, …, lr is a subsequence of 1, 2, …, k then li1:tei1, li2:tei2, …, lir:teir is a tree-expression of object o. Representation A tree-expression li1:tei1, li2:tei2, …, lir:teir consists of k subtrees teij each being labeled lij.

  9. Incremental Schema Mining Problem definition Input • A collection of transaction objects in an OEM graph, denoted as DB • A minimum support threshold MINSUP • The frequent tree expressions for DB • A number of new objects added into the collection, denoted as db The incremental schema maintenance problem is to discover all tree expressions which have support in DB  db greater than or equal to MINSUP.

  10. DeltaSSD • DeltaSSD utilizes Negative Borders Definition [Negative Border] Given a collection of S  P(R) of tree expressions, closed with respect to the “weaker than” relation [6], the negative border Bd- of S consists of the minimal tree expressions X  R not in S.

  11. DeltaSSD (notation)

  12. DeltaSSD

  13. Experimental settings Generation of synthetic data • One dataset : • (L1, N1) = (25, 1000) • (L2, N2, T2, I2, P2) = (25, 1000, 4, 2, 50) • (N3, T3, I3, P3) = (3000, 4, 2, 50) • Relatively small database, 3000 objects. • Short and “bushy” transactions (thus, few database scans).

  14. Performance Evaluation Database scans

  15. Performance Evaluation Operations (CPU time)

  16. Conclusions • DeltaSSD is very efficient in terms of database scans • DeltaSSD incurs excessive processing in terms of tree matchings • Re-computing the frequent tree-expressions is inefficient • Future work includes: • Investigation of the complete closure approach • Techniques to reduce the processing cost of tree matching

  17. References • Y. Aumann, R. Feldman, O. Liphstat and H. Mannila, "Borders: An Efficient Algorithm for Association Generation in Dynamic Databases", Journal of Intelligent Information Systems, vol. 12, no. 1, pp. 61-73, 1999. • R. Feldman, Y. Aumann, A. Amir and Mannila, H., "Efficient algorithms for discovering frequent sets in incremental databases", Proceedings of the ACM Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'97), 1997. • H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge Discovery", Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997. • V. Pudi and J. Haritsa, "Quantifying the utility of the past in mining large databases", Information Systems, vol. 25, no. 5, pp. 323-343, 2000. • S. Thomas, S. Bodagala, K. Alsabti and S. Ranka, "An efficient algorithm for the incremental updation of association rules in large databases", Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'97), pp. 263-266, 1997. • K. Wang and H. Liu,"Discovering Structural Association of Semistructured Data", IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000. • A. Zhou, Jinwen, S. Zhou and Z. Tian, "Incremental Mining of Schema for Semistructured Data", Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), pp. 159-168, 1999.

More Related