170 likes | 189 Views
XML clustering methods. Sohn Jong-Soo mis026@korea.ac.kr Intelligent Information System Lab. Korea Univ. 2007.11.06. 0. Index. Introduction XML and XML schema Relational vs. XML Paper overview My works. 1. Introduction. XML
E N D
XML clustering methods Sohn Jong-Soo mis026@korea.ac.kr Intelligent Information System Lab. Korea Univ. 2007.11.06
0. Index • Introduction • XML and XML schema • Relational vs. XML • Paper overview • My works
1. Introduction • XML • It has become a standard for information exchange and retrieval • With the continuous growth in the XML data • The ability to manage massive collections of XML data and to discover knowledge from them becomes essential • For web based information system • Clustering method • Database objects, text data, multimedia data • XML data is different • Semi-structured • Hierarchical
XML-1 XML-13 XML-1234 XML-2 XSLT ( DOM,SAX) XSLT ( DOM,SAX) XML-24 XML-3 XML Content XML file Structure XML schema, DTD Style XLS, CSS XML-4 2. XML and XML schema • XML • XML document • XML schema • Can be obtained separately without scanning the whole document • Style sheet • XLS, CSS
2. XML and XML schema • XML documents have elements and attributes • Elements (indicated by begin & end tags) • can be nested but cannot interleave each other • can have arbitrary number of sub-elements • can have free text as values <chap title=“Introduction To XML”> some free text <sect title= “What is XML?”> … </sect> <secttitle = “Elements”> … </sect> <secttitle = “Why XML?”> … </sect> … possibly more free text </chap> attribute end element begin element Elements w/ same name can be nested
chap sect sect sect sect sect sect 2. XML and XML schema • Database Side: XML is a new way to organize data • Relational databases organize data in tables • XML documents organize data in ordered trees • Document Side: XML is a semantic markup language • HTML focuses on presentation • XML focuses on semantics/structure in the data <html> <h1> Chapter 1… </h1> some free text <h2> Section 1… </h2> some more free text <h3> Section 1.1 </h3> </html>
3. Relational vs. XML • Relational data are well organized – fully structured (more strict): • E-R modeling to model the data structures in the application; • E-R diagram is converted to relational tables and integrity constraints (relational schemas) • XML data are semi-structured (more flexible): • Schemas may be unfixed, or unknown (flexible – anyone can author a document) • Suitable for data integration (data on the web, data exchange between different enterprises).
3. Relational vs. XML • XML is not meant to replace relational database systems • RDBMSs are well suited to OLTP applications • (e.g., electronic banking) • which has 1000+ small transactions per minute. • XML is suitable data exchange over heterogeneous data sources • (e.g., Web services) • that allow them to “talk”.
3. Relational vs. XML • Advantages of using XML • Manage large volume of XML data • Provide high-level declarative language • Efficiently evaluate complex queries • XML Data Management Issues: • XML Data Model • XML Query Languages • XML Query Processing, Optimization and Classification • I have interest in this branch !
4. Paper overview • XML schema clustering with semantic and hierarchical similarity measures • This paper presents a XML schema clustering process • By organising the heterogeneous XML schemasinto various groups • Combining the semantic and syntactic relationships • To calculate the linguistic similarity bet. Two elements • Considering the ancestor-child relationship • Generalizing a suitable schema class hierarchy • Using Xmine methodology
4. Paper overview • Evaluating Structural Similarity in XML Documents • Develop a dynamic programmingalgorithm • to find this distance for any pairof documents • It define a new method forcomputing the distance • between any two XML documents interms of their structure • The lower this distance • the more similar the twodocuments are in terms of structure • the more likely • they are to have been createdfrom the same DTD
4. Paper overview • A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications • This paper proposes a matching algorithm for measuring the structural similarity • between an XML document and a DTD • The matching algorithm • by comparing the document structure against the one the DTD requires • is able to identify commonalities and differences
4. Paper overview • This paper focused on five applications of the algorithm: (1) the classification of XML documents against a set of DTDs (2) the generation of a new schema • for a DTD by extracting structural information during the classification of XML documents; (3) the development of an XML-based search engine • able to answer approximate structural queries (4) the selective dissemination of XML documents (5) the protection of the contents of documents classified • against a set of DTDs of a database, by propagating the authorization policies specified at DTD level
4. Paper overview • Schema Matching for Transforming Structured Documents • Understanding the matching problem in the context of structured document transformations • And developing matching methods those output serves as the basis for the automatic generation of transformation scripts • Four basic matching process • linguistic matching • datatype compatibility • Designer type hierarchy • structural matching
5. My works • XML data classification • Using a XML schema and its XML files • ID3 Algorithm • By classification tool on XML data • It will contribute to XML data preprocessing for datamining • Problems • XML has hierarchical data type • It can’t present like a table • Insufficient of sample data
References • E. Bertino, G. Guerrini, M. Mesiti, A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications, Information Systems 29 (1) (2004) 23–46. • A. Boukottaya, C. Vanoirbeek, 2005, November 02–04, Schema matching for transforming structured documents. Paper presented at the The 2005 ACM Symposium on Document engineering, Bristol, United Kingdom. • A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of disparate sources: a machine-learning approach. Paper presented at the ACM SIGMOD, Santa Barbara, California, United States. • S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of XML structural similarities, IEEE Transaction on Knowledge and Data Engineering 7 (2) (2005) 160–175. • R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm for heterogenous XML documents. Paper presented at the The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 2006.
References • A. Nierman, H.V. Jagadish, 2002, December, Evaluating structural similarity in XML documents. Paper presented at the fifth International Conference on Computational Science (ICCS’05), Wisconsin, USA. • Richi Nayak, Wina Iryadi 2006, XML schema clustering with semantics and hierarchical similarity measures. • http://www.w3c.org/xml