290 likes | 410 Views
Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO. LORNET Theme 4. Data Mining and Knowledge Extraction for LO. T L : Mohamed Kamel PI’s: O. Basir, F. Karray, H. Tizhoosh Assoc PI’s: A. Wong, C. DiMarco. Knowledge Extraction and LO Mining. GOAL:
E N D
Pattern Analysis & Machine IntelligenceResearch GroupUNIVERSITY OF WATERLOO LORNET Theme 4 Data Mining and Knowledge Extraction for LO T L : Mohamed Kamel PI’s: O. Basir, F. Karray, H. Tizhoosh Assoc PI’s: A. Wong, C. DiMarco
Knowledge Extraction and LO Mining GOAL: • Develop Data mining and knowledge extraction techniques and tools for learning object repositories. • These tools can provide context and facilitate interactions, efficient organization, efficient delivery, navigation and retrieval. PAMI Research Group, University of Waterloo
Theme Overview LO Mining From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-based From Images Image Features, Shape Features From Text + Images Describing Images with Text Enriching Text with Images Knowledge Extraction Classification (MCS, Data Partitioning, Imbalanced Classes) Clustering (Parallel/Distributed Clustering, Cluster Aggregation) LO Similarity and Ranking Association Rules / Social Networks Reinforcement Learning Specialized / Personalized Search Tagging and Organizing Matching and Ranking PAMI Research Group, University of Waterloo
Types of Data in LORNET TELOS LCMS Course Module Lesson LO Course Module Lesson LO Module Lesson LO Course Subject Matter Text, Images, Flash, Applets, Metadata, Interaction Logs Resource Resource Resource Discussion Board Board Thread Post Board Thread Post Board Thread Post SemanticLayer Discussions Text, Interaction Logs LOR Record Metadata Record Metadata Record Metadata Resources Metadata,Semantic References LO Descriptors Metadata PAMI Research Group, University of Waterloo
LO Mining Scenarios PAMI Research Group, University of Waterloo
LO Mining and Knowledge Extraction PAMI Research Group, University of Waterloo
Projects Overview Image Interaction Logs Text Document Text Document Information Extraction Analyzing content to extract relevant information Categorization Organizing LOs according to their content Classification - Traditional - MCS - Imbalanced Keyword Extraction Summarization Concept Extraction Social Network Analysis - Traditional - Ensembles - Distributed Clustering Personalization Providing user-specific results Image Mining Describing and finding relevant images ReinforcementLearning - Traditional - Opposition- based CBIR - Traditional - Fusion-based Integration and Applications Software Components In Progress Publications Theme and Industry Collaboration PAMI Research Group, University of Waterloo
Information Extraction: Summarization LO Content Package Summarization • Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis. • Statistical term weighting and sentence ranking are performed on each document, and to the whole collection. • Top relevant sentences are extracted for each document. • Planned functionality: Summarization of whole modules or lessons (as opposed to single documents). • Benefits • Provide summarized overview of learning objects for quick browsing and access to learning material. • Scenarios • Learning Management Systems can call the summarization component to produce summaries for content packages. Data is courtesy University of Saskatchewan PAMI Research Group, University of Waterloo
Information Extraction: Concept Extraction Concept-Based Statistical Analyser Conceptual Ontological Graph (COG) Ranking PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction Semantic Keyword Extraction • Tasks • Developing tools and techniques to extract semantic keywords toward facilitating metadata generation • Developing algorithms to enrich metadata (tags) which can be applied in index-based multimedia retrieval • Progress • Proposed a new information theoretic inclusion index to measure the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology) • Makrehchi, M. and Kamel, ICDM07, WI 07 PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction • Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination • Learns 20 rules from the first 50 training rules • Learns 13 additional rules from the next 220 training rules Rule-based Keyword Extraction • Learn rules to find keywords in English sentences • Rules represent sentence fragments • Specific enough for reliable keyword extraction • General enough to be applied to unseen sentences • Rule generalization • Begin with an exact sentence fragment • Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy • Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise • Keyword extraction • Find sequence of rules that best cover an unseen sentence • Extract keywords according to rules • Both precision and recall values increase during training • Precision (blue) increases 10% • Recall (red) shows slight upward trend PAMI Research Group, University of Waterloo
Categorization: Ensemble-based Clustering • Consensus Clustering • Categorization of learning objects using proposed consensus clustering algorithms. • The goal of consensus clustering is to find a clustering of the data objects that optimally summarizes an ensemble of multiple clusterings. • Consensus clustering can offer several advantages over a single data clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes. • Tasks • Development of techniques for producing ensembles of multiple data clusterings where diverse information about the structure of the data is likely to occur. • Development of consensus algorithms to aggregate the individual clusterings. • Develop solutions for the cluster symbolic-label matching problem • Empirical analysis on real-world data and validation of proposed method. PAMI Research Group, University of Waterloo
Categorization using cluster ensemble PAMI Research Group, University of Waterloo
Categorization: Distributed Clustering Hierarchical P2P Document Clustering • Peer nodes are arranged into groups called “neighborhoods”. • Multiple neighborhoods are formed at each level of the hierarchy. • This size of each neighborhood is determined through a network partitioning factor. • Each neighborhood has a designated supernode. • Supernodes of level h form the neibhorhoods for level h+1. • Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes. • Benefits • Significant speedup over centralized clustering and flat peer-to-peer clustering. • Multiple levels of clusters. • Distributed summarization of clusters using CorePhrase keyphrase extraction. • Scenarios • Distributed knowledge discovery in hierarchical organizations. HP2PC Architecture HP2PC Example3-level network, 16 nodes PAMI Research Group, University of Waterloo
Categorization: Multiple Classifier Systems • Progress • Proposed a set of evaluation measures to select sub-optimal training partitions for training classifier ensembles. • Proposed an ensemble training algorithm called Clustering, De-clustering, and Selection (CDS). • Proposed and optimized a cooperative training algorithm called Cooperative Clustering, De-clustering, and Selection (CO-CDS). • Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification. • Tasks • To investigate various aspects of cooperation in Multiple Classifier Systems (Classifier Ensembles) • To develop evaluation measures in order to estimate various types of cooperation in the system • To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures • To apply these findings to optimize existing ensemble methods • To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution • Objective • Advance classification of multi-class imbalanced data • Tasks • To develop cost-sensitive boosting algorithm AdaC2.M1 • To improve the identification performance on the important classes • To balance classification performance among several classes PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution Performance of Base Classification and AdaBoost Class Distribution Balanced performance among classes - Evaluated by G-mean PAMI Research Group, University of Waterloo
Personalization • Opposition-based Reinforcement Learning for Personalizing Image Search • Developing a reliable technique to assist users, facilitate and enhance the learning process • Personalized ORL tool assists user to observe the searched images desirable for her/him • Personalized tool gathers images of the searched results, selects a sample of them • By interacting with user and presenting the sample, it learns the user’s preferences PAMI Research Group, University of Waterloo
Personalization PAMI Research Group, University of Waterloo
Image Mining: CBIR • Content based image retrieval • Build an IR system that can retrieve images based on: Textual Cues, Image content, NL Queries • Documents contain QI Image Retrieval Tool Set images • Images contain QT • Images match QI • NL Description of Image Rich Documents • Automated image tagging • Query Image QI • Query Text QT • Query Document PAMI Research Group, University of Waterloo
Illustrative Example IZM FD Accuracy= 55% Accuracy= 70% Accuracy= 95% Accuracy= 60% The proposed approach MTAR PAMI Research Group, University of Waterloo
Experimental Results (Cont’d) The Performance of the proposed approach PAMI Research Group, University of Waterloo
Integration and Applications • Progress • Finished core parts of the common data mining framework. • Built components and services from theme researchers’ work around the data mining framework. • Provided documentation for the data mining framework and software components. • Launched web site to host components and documentation from Theme 4:http://pami.uwaterloo.ca/projects/lornet/software/ PAMI Research Group, University of Waterloo
Integration and Applications • Progress • Core parts of the common data mining framework are available, including: • Vector and matrix manipulation. • Document parsing and tokenization. • Statistical term and sentence analysis. • Similarity calculation using multiple distance functions. • IMS Content Package compliant parser. • Components and tools built around the common data mining framework: • Metadata extraction from single documents; supports Dublin Core encoding. • Document similarity calculation using cosine similarity. • Single document and content package summarization. • Building of standard text datasets from large document collections. • Integration with TELOS: • Developed C# TELOS connector for integrating Theme 4 components. • Worked on component manifest specification with Theme 6. • Provided metadata extraction as part of a complete scenario for TELOS components integration. • The following components were wrapped for use by TELOS through the C# connector: Automatic Metadata Extractor, Document Similarity, and Document Summarizer. PAMI Research Group, University of Waterloo
Industry Collaboration • Pattern Discovery Software (PDS) provided data mining software tools for use by researchers. • Vestech provided opportunities for researchers to work on speech technologies. • Desire2Learn opened job opportunities for LORNET researchers. PAMI Research Group, University of Waterloo
Software Components Overview of Components Scenarios for Use of Software Components • General Tools • C# Connector for TELOS • Common Data Mining Framework • Standard Text Mining Tools • Metadata Extractor • Document Summarizer • Content Package Summarizer • Document Similarity • LO Recommender • Metadata Harvester • Keyword Extractor • Taxonomy Extractor • Metadata Enrichment Tools • Concept-based and Semantic Text Mining Tools • Metadata Extractor • LO Search Engine • Document Similarity • Document Classifier • Document Clusterer • Semantic-based Ontology Representation • Semantic Metadata Matching • POS Rule-Learning System • Triplet Representation System • Categorization Tools • LO Classifier • LO Multiple Classifier • LO Clusterer • LO Ensemble Clusterer • LO Consensus Clusterer • LO Distributed Clusterer Environment Data Types Tasks TELOS • Metadata • Ontology • Ontology construction and unification • Finding relations between components • Ranking components • Grouping components • Tagging components Learning Object Repository • Metadata • Structured Text • Categorical • Automatic metadata extraction • LO automatic classification • LO organization through clustering • Multiple organization strategiesthrough cluster ensembles e-Learning Environment • Structured Text • Images • Object Relationships • Context • Extracting concepts from LO • Summarizing Documents • Grouping LOs • Tagging LOs • Discovering Similar Topics • Discovering Similar Peers • Building Social Networks • Detecting Plagiarism • LO recommendationusing similarity ranking • Personalization / Specialization through reinforcement learning • User-centric Tools • Personalized Search Engine • Social Network Learner • Image Mining Tools • Content-based Image Search • Personalized Image Search • Consensus-based Fusion for Image Retrieval Legend • Integrated • Ready • In Progress • Year 5 PAMI Research Group, University of Waterloo
Publications PAMI Research Group, University of Waterloo
PI’s: Dr. Basir Dr. Tizhoosh Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi Funding CRC/CFI/OIT NSERC PAMI Lab Dr. Karray Asso PI (Wong, DiMarco M. Shokri S. Hassan A. Farahat Dr. R. Khoury PDS, Vestech, Desire2Learn Theme 4 TeamLeader: M. Kamel Graduated • R. Khoury, PhD 07 • L. Chen, PhD 07 • M. Makhreshi,PhD 07 • K.Hammouda,PhD 07 • R. Dara, PhD 07 • Y.Sun, PhD 07 • K. Shaban, PhD 06 • Y. Sun, PhD 06 • M. Hussin, PhD 05 • Jan Bakus, PhD 05 • A. Adegorite, MA.Sc04 • A. Khandani, MA.Sc05. • S. Podder, MA.Sc.04 PAMI Research Group, University of Waterloo
Pattern Analysis and Machine Intelligence Lab Electrical and Computer Engineering University of Waterloo Canada www.pami.uwaterloo.ca www.pami.uwaterloo.ca/projects/lornet/software/ • www.pami.uwaterloo.ca/kamel.html publications PAMI Research Group, University of Waterloo