180 likes | 331 Views
A Robust System Architecture For Mining Semi-structured Data. By Aby M Mathew CSE 6331 11301999. Introduction. A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components. Motivation .
E N D
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 6331 11301999
Introduction A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components.
Motivation • A digital library could contain tons of document concepts, using SQL - possible to generate quantitative rules, based on a certain criteria. • What about rules related to a subset such as, • which journal publishes articles associated within an area of interest.
Presentation Organization • Overview of the IRIS system. • Differences between structured & unstructured data. • How is the data stored. • Algorithm used for rule generation. • Conclusion.
Overview of the IRIS system GUI Rule Generator Concept Library Database IDM Document Collection
Brief Description Of Individual Components • Rule Generator - parses the user request via GUI and determines an execution strategy. • Database contains structured data - which has mappings b/w tuples and the document. • Concept library maintains unstructured data as concepts - mappings exist b/w concepts and documents.
Contd .. • IDM ( Information discovery module ) • extracts concepts and structured values from a document collection • updates the database and concept library.
parser optimizer processor Components of the Rule Generator • Parser - accepts data and reconditions it for the optimizer. • Optimizer - uses the constraints, rule type and generates an efficient execution plan. • Processor - executes plans laid out by the optimizer.
Discoverer Extractor Refresher Components of the IDM • Discoverer - Intelligent agent that determines domains. • Extractor - Based on the domain knowledge, it populates the database and concept library. • Refresher - Helps maintain consistency of the database and concept library.
Differences b/w the two data types • Structured data type • Certain features that forms key entities. E.g.., Author, Publisher, Date etc. • Unstructured data type • Blocks of text that are unidentifiable as structured. E.g.., Abstract headings, paragraphs etc.
How is the data stored ? • Structured data is stored using a relational schema that is mapped to a database. • Unstructured data is stored in a compressed form using ECH(extended concept hierarchy).
Extended Concept Hierarchy • This is a hierarchical form of representing data. its not always constrained to a tree structure. relationships maintain additional links b/w the entities in the hierarchy.
Example University ECH Employees Admin Faculty Provost Dean Full Associate
Calculation of minimum support (min sup) in ECH If C1 & C2 are the two concepts found in the document, then min sup = documents( C1 ) documents( C2 ) documents( C1 ) documents( C2 ) where ‘documents ( c )’ is the number of documents where concept ‘c’ occurs.
Example for calculating min sup Say concept C1 appears in 500 documents and C2 appears in 600 documents, 100 of which concept C1 also appears. Min sup = 100 / 1000 = 0.1
Algorithm used for rule generation • Get Document ids of documents containing structured data value - using SQL statements. ( set ‘A’ ). • Get Document ids of documents containing unstructured concept - using ECH. ( set ‘B’ ). • C = A B. • Get document ids of concept Cr where Cr is related to C1 via edge P, C or S. If the min sup of Cr & C1 are above min sup. ( set ‘D’ ). • E = C D. • confidence = ( num elements in E ) / ( num elements in C ).
Advantages of Using this system • Distinguishing b/w structured -vs- unstructured data, helps generate more interesting rules. • Being domain specific - accuracy improves. • Scalable as any database can be used as the database component. • Meaningful data is stored - compact representation of the document.
Bibliography • L. Singh, P. Scheurmann & B. Chen, “IRIS: Our prototype rule generation system”, 1999. • L. Singh, P. Scheurmann & B. Chen, “Generating Association Rules from Semi-structured documents using an Extended concept Hierarchy”, 1999.