1 / 18

A Robust System Architecture For Mining Semi-structured Data

A Robust System Architecture For Mining Semi-structured Data. By Aby M Mathew CSE 6331 11301999. Introduction. A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components. Motivation .

Download Presentation

A Robust System Architecture For Mining Semi-structured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 6331 11301999

  2. Introduction A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components.

  3. Motivation • A digital library could contain tons of document concepts, using SQL - possible to generate quantitative rules, based on a certain criteria. • What about rules related to a subset such as, • which journal publishes articles associated within an area of interest.

  4. Presentation Organization • Overview of the IRIS system. • Differences between structured & unstructured data. • How is the data stored. • Algorithm used for rule generation. • Conclusion.

  5. Overview of the IRIS system GUI Rule Generator Concept Library Database IDM Document Collection

  6. Brief Description Of Individual Components • Rule Generator - parses the user request via GUI and determines an execution strategy. • Database contains structured data - which has mappings b/w tuples and the document. • Concept library maintains unstructured data as concepts - mappings exist b/w concepts and documents.

  7. Contd .. • IDM ( Information discovery module ) • extracts concepts and structured values from a document collection • updates the database and concept library.

  8. parser optimizer processor Components of the Rule Generator • Parser - accepts data and reconditions it for the optimizer. • Optimizer - uses the constraints, rule type and generates an efficient execution plan. • Processor - executes plans laid out by the optimizer.

  9. Discoverer Extractor Refresher Components of the IDM • Discoverer - Intelligent agent that determines domains. • Extractor - Based on the domain knowledge, it populates the database and concept library. • Refresher - Helps maintain consistency of the database and concept library.

  10. Differences b/w the two data types • Structured data type • Certain features that forms key entities. E.g.., Author, Publisher, Date etc. • Unstructured data type • Blocks of text that are unidentifiable as structured. E.g.., Abstract headings, paragraphs etc.

  11. How is the data stored ? • Structured data is stored using a relational schema that is mapped to a database. • Unstructured data is stored in a compressed form using ECH(extended concept hierarchy).

  12. Extended Concept Hierarchy • This is a hierarchical form of representing data. its not always constrained to a tree structure. relationships maintain additional links b/w the entities in the hierarchy.

  13. Example University ECH Employees Admin Faculty Provost Dean Full Associate

  14. Calculation of minimum support (min sup) in ECH If C1 & C2 are the two concepts found in the document, then min sup = documents( C1 )  documents( C2 ) documents( C1 )  documents( C2 ) where ‘documents ( c )’ is the number of documents where concept ‘c’ occurs.

  15. Example for calculating min sup Say concept C1 appears in 500 documents and C2 appears in 600 documents, 100 of which concept C1 also appears. Min sup = 100 / 1000 = 0.1

  16. Algorithm used for rule generation • Get Document ids of documents containing structured data value - using SQL statements. ( set ‘A’ ). • Get Document ids of documents containing unstructured concept - using ECH. ( set ‘B’ ). • C = A  B. • Get document ids of concept Cr where Cr is related to C1 via edge P, C or S. If the min sup of Cr & C1 are above min sup. ( set ‘D’ ). • E = C  D. • confidence = ( num elements in E ) / ( num elements in C ).

  17. Advantages of Using this system • Distinguishing b/w structured -vs- unstructured data, helps generate more interesting rules. • Being domain specific - accuracy improves. • Scalable as any database can be used as the database component. • Meaningful data is stored - compact representation of the document.

  18. Bibliography • L. Singh, P. Scheurmann & B. Chen, “IRIS: Our prototype rule generation system”, 1999. • L. Singh, P. Scheurmann & B. Chen, “Generating Association Rules from Semi-structured documents using an Extended concept Hierarchy”, 1999.

More Related