Enhanced Clustering for XML Schema Analysis

XML clustering methods Sohn Jong-Soo mis026@korea.ac.kr Intelligent Information System Lab. Korea Univ. 2007.11.06

0. Index • Introduction • XML and XML schema • Relational vs. XML • Paper overview • My works

1. Introduction • XML • It has become a standard for information exchange and retrieval • With the continuous growth in the XML data • The ability to manage massive collections of XML data and to discover knowledge from them becomes essential • For web based information system • Clustering method • Database objects, text data, multimedia data • XML data is different • Semi-structured • Hierarchical

XML-1 XML-13 XML-1234 XML-2 XSLT ( DOM,SAX) XSLT ( DOM,SAX) XML-24 XML-3 XML Content XML file Structure XML schema, DTD Style XLS, CSS XML-4 2. XML and XML schema • XML • XML document • XML schema • Can be obtained separately without scanning the whole document • Style sheet • XLS, CSS

2. XML and XML schema • XML documents have elements and attributes • Elements (indicated by begin & end tags) • can be nested but cannot interleave each other • can have arbitrary number of sub-elements • can have free text as values <chap title=“Introduction To XML”> some free text <sect title= “What is XML?”> … </sect> <secttitle = “Elements”> … </sect> <secttitle = “Why XML?”> … </sect> … possibly more free text </chap> attribute end element begin element Elements w/ same name can be nested

chap sect sect sect sect sect sect 2. XML and XML schema • Database Side: XML is a new way to organize data • Relational databases organize data in tables • XML documents organize data in ordered trees • Document Side: XML is a semantic markup language • HTML focuses on presentation • XML focuses on semantics/structure in the data <html> <h1> Chapter 1… </h1> some free text <h2> Section 1… </h2> some more free text <h3> Section 1.1 </h3> </html>

3. Relational vs. XML • Relational data are well organized – fully structured (more strict): • E-R modeling to model the data structures in the application; • E-R diagram is converted to relational tables and integrity constraints (relational schemas) • XML data are semi-structured (more flexible): • Schemas may be unfixed, or unknown (flexible – anyone can author a document) • Suitable for data integration (data on the web, data exchange between different enterprises).

3. Relational vs. XML • XML is not meant to replace relational database systems • RDBMSs are well suited to OLTP applications • (e.g., electronic banking) • which has 1000+ small transactions per minute. • XML is suitable data exchange over heterogeneous data sources • (e.g., Web services) • that allow them to “talk”.

3. Relational vs. XML • Advantages of using XML • Manage large volume of XML data • Provide high-level declarative language • Efficiently evaluate complex queries • XML Data Management Issues: • XML Data Model • XML Query Languages • XML Query Processing, Optimization and Classification • I have interest in this branch !

4. Paper overview • XML schema clustering with semantic and hierarchical similarity measures • This paper presents a XML schema clustering process • By organising the heterogeneous XML schemasinto various groups • Combining the semantic and syntactic relationships • To calculate the linguistic similarity bet. Two elements • Considering the ancestor-child relationship • Generalizing a suitable schema class hierarchy • Using Xmine methodology

4. Paper overview • Evaluating Structural Similarity in XML Documents • Develop a dynamic programmingalgorithm • to find this distance for any pairof documents • It define a new method forcomputing the distance • between any two XML documents interms of their structure • The lower this distance • the more similar the twodocuments are in terms of structure • the more likely • they are to have been createdfrom the same DTD

4. Paper overview • A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications • This paper proposes a matching algorithm for measuring the structural similarity • between an XML document and a DTD • The matching algorithm • by comparing the document structure against the one the DTD requires • is able to identify commonalities and differences

4. Paper overview • This paper focused on five applications of the algorithm: (1) the classification of XML documents against a set of DTDs (2) the generation of a new schema • for a DTD by extracting structural information during the classification of XML documents; (3) the development of an XML-based search engine • able to answer approximate structural queries (4) the selective dissemination of XML documents (5) the protection of the contents of documents classified • against a set of DTDs of a database, by propagating the authorization policies specified at DTD level

4. Paper overview • Schema Matching for Transforming Structured Documents • Understanding the matching problem in the context of structured document transformations • And developing matching methods those output serves as the basis for the automatic generation of transformation scripts • Four basic matching process • linguistic matching • datatype compatibility • Designer type hierarchy • structural matching

5. My works • XML data classification • Using a XML schema and its XML files • ID3 Algorithm • By classification tool on XML data • It will contribute to XML data preprocessing for datamining • Problems • XML has hierarchical data type • It can’t present like a table • Insufficient of sample data

References • E. Bertino, G. Guerrini, M. Mesiti, A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications, Information Systems 29 (1) (2004) 23–46. • A. Boukottaya, C. Vanoirbeek, 2005, November 02–04, Schema matching for transforming structured documents. Paper presented at the The 2005 ACM Symposium on Document engineering, Bristol, United Kingdom. • A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of disparate sources: a machine-learning approach. Paper presented at the ACM SIGMOD, Santa Barbara, California, United States. • S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of XML structural similarities, IEEE Transaction on Knowledge and Data Engineering 7 (2) (2005) 160–175. • R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm for heterogenous XML documents. Paper presented at the The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 2006.

References • A. Nierman, H.V. Jagadish, 2002, December, Evaluating structural similarity in XML documents. Paper presented at the fifth International Conference on Computational Science (ICCS’05), Wisconsin, USA. • Richi Nayak, Wina Iryadi 2006, XML schema clustering with semantics and hierarchical similarity measures. • http://www.w3c.org/xml

Enhanced Clustering for XML Schema Analysis

Enhanced Clustering for XML Schema Analysis

Presentation Transcript

Clustering Methods

Clustering Methods for Class Discovery

An Overview of Clustering Methods

Clustering Methods

Clustering methods used in microarray data analysis

Chapter 8: Classification and Clustering Methods

Clustering Methods

Clustering: Partition Clustering

4. Clustering Methods

Density-Based and other Clustering Methods

Data Clustering Methods

Greedy clustering methods

Datamining_3 Clustering Methods

Clustering methods Course code: 175314

Spatial Clustering Methods

Learning Element Similarity for XML Document Clustering

Clustering Methods

XML clustering methods

Clustering methods

COMP3503 Automated Discovery and Clustering Methods

4. Clustering Methods