Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO

Pattern Analysis & Machine IntelligenceResearch GroupUNIVERSITY OF WATERLOO LORNET Theme 4 Data Mining and Knowledge Extraction for LO T L : Mohamed Kamel PI’s: O. Basir, F. Karray, H. Tizhoosh Assoc PI’s: A. Wong, C. DiMarco

PI’s: Dr. Basir Dr. Tizhoosh Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi Funding CRC/CFI/OIT NSERC PAMI Lab Dr. Karray Asso PI (Wong, DiMarco M. Shokri S. Hassan A. Farahat Dr. R. Khoury PDS, Vestech, Desire2Learn Theme 4 TeamLeader: M. Kamel Graduated • R. Khoury, PhD 07 • L. Chen, PhD 07 • M. Makhreshi,PhD 07 • K.Hammouda,PhD 07 • R. Dara, PhD 07 • Y.Sun, PhD 07 • K. Shaban, PhD 06 • Y. Sun, PhD 06 • M. Hussin, PhD 05 • Jan Bakus, PhD 05 • A. Adegorite, MA.Sc04 • A. Khandani, MA.Sc05. • S. Podder, MA.Sc.04 PAMI Research Group, University of Waterloo

Data and Knowledge Mining • Knowledge extraction and discovery of patterns from data. • Labeling and categorization, summarization, classification, prediction, association rules, clustering PAMI Research Group, University of Waterloo

Theme Overview LO Mining From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-based From Images Image Features, Shape Features From Text + Images Describing Images with Text Enriching Text with Images Knowledge Extraction Classification (MCS, Data Partitioning, Imbalanced Classes) Clustering (Parallel/Distributed Clustering, Cluster Aggregation) LO Similarity and Ranking Association Rules / Social Networks Reinforcement Learning Specialized / Personalized Search Tagging and Organizing Matching and Ranking PAMI Research Group, University of Waterloo

Types of Data in LORNET TELOS LCMS Course Module Lesson LO Course Module Lesson LO Module Lesson LO Course Subject Matter Text, Images, Flash, Applets, Metadata, Interaction Logs Resource Resource Resource Discussion Board Board Thread Post Board Thread Post Board Thread Post SemanticLayer Discussions Text, Interaction Logs LOR Record Metadata Record Metadata Record Metadata Resources Metadata,Semantic References LO Descriptors Metadata PAMI Research Group, University of Waterloo

Abstract View of Data for Mining • Text (Plain or Markup) • Any resource that contains text is viewed as an abstract text document (some markup can be preserved to indicate different weights); e.g. HTML page, Word document, email message, discussion post, even metadata records. • Suitable for text mining, information/metadata extraction, summarization, natural language processing, semantic/concept analysis, social network analysis. • Numeric Matrix (Vector Space Model) • Requires text mining algorithms to convert the original text to numeric form through feature extraction and statistical weighting. • Suitable for machine learning algorithms that expect numeric input, especially classification and clustering algorithms. • Feature Vectors • Suitable for mining images: description, indexing, and retrieval (CBIR). Requires image processing algorithms to extract image features. • Also suitable for mining and learning from interaction logs, where each vector describes an event. • Relationship • Provides domain knowledge about data, such as containment (e.g. LO within Course, Post within Thread) and relatedness (collection of resources, cross-referenced LOs). • The extra knowledge could be exploited to improve accuracy, or to apply the same algorithm to different parts of the data (e.g. generating one summary for entire course, or one summary per lesson.) PAMI Research Group, University of Waterloo

Data Representation • What level of granularity • One representation or multiple • Feature representation • Dimensionality issues PAMI Research Group, University of Waterloo

Document Modeling • Document is represented by a set of concepts called “indexing terms”  Document segmentation • sub-word level (decomposition of words and their morphology) • word level (words and lexical information) • multi-word level (phrases and syntactic information) • semantic level (the meaning of the text) • pragmatic level (the meaning of the text with respect to the context and situation- ontology?) PAMI Research Group, University of Waterloo

Document Modeling sub-word word multiword semantic pragmatic noise & redundancy dimensionality content-based context-based complex algorithms required domain knowledge PAMI Research Group, University of Waterloo

Document Modeling sub-word word multiword semantic pragmatic Not usual Not explored Emerging Term-level (most popular) PAMI Research Group, University of Waterloo

Document Modeling • Bag-of-words (VSM): most popular document representation model • word sequence • weighting terms by their importance (based on frequency) • terms are independent and uncorrelated • Bag-of-words (VSM):Drawbacks • ignoring term dependencies and correlations • ignoring text structure • ignoring ordering of the words in the document • IR research shows that word ordering is not important. • ignoring grammar  language independent • Solutions: generalized VSM, LSI, Phrase based model, concept based representation PAMI Research Group, University of Waterloo

Curse of Dimensionality • the number of training samples is exponential function of the number of features • For a fixed sample size, increasing the number of features may degrade the performance (Peaking Phenomenon) • Limited sample size leads the overfitting problem which implies the lack of generalization and low performance. PAMI Research Group, University of Waterloo

Dimensionality Reduction • Feature extraction • employing all dimensions and measurement space to obtain a new transformed space (compacting feature space without removing any) • identifying important combination of the features (PCA, manifold learning, SVD and factor analysis) • low dimensional embeddings (random projections) • Pros and Cons + promising results + solid mathematical background - high complexity (time and space) • lack of scalability • fails in high dimensional problems of data mining • extracted features usually have no meaning. PAMI Research Group, University of Waterloo

Dimensionality Reduction • Feature selection • reducing the feature space dimensionality by removing useless, redundant, irrelevant and noise features • it is a problem of searching for a subset of features among the total number of features based on one or more performance index (objective function) Makrehchi and Kamel, IEEE SMC 07. PAMI Research Group, University of Waterloo

New Representation Models • Phrase Based Representation Document Index Graph(DIG) Hammouda and Kamel, KIS 2004, IEEE KDE 2004 • Concept Based Representation Shehata, Karray and Kamel, ICDM 2006, KDD 07, WI07 PAMI Research Group, University of Waterloo

Concept-based Mining Model PAMI Research Group, University of Waterloo

Concept-based Statistical Analyzer • Text Preprocessing • Separate sentences • Label terms • Remove stop-words • Stem words • Concept-based • Term Analysis • Term frequency (tf) • Conceptual term • frequency (ctf) Text Docs Cluster 2 • Clustering Techniques • Single Pass • HAC (ward) • HAC (complete) • k-NN Concept-based Document Similarity Cluster 1 Cluster 3 PAMI Research Group, University of Waterloo

Evaluation F-measure of the HAC (Ward) (Higher is better) Entropy of the HAC (Ward) (Lower is better) PAMI Research Group, University of Waterloo

Evaluation (cont.) F-measure of the k-NN Entropy of the k-NN PAMI Research Group, University of Waterloo

Classification sports set of objects finance Classifier farming • Function that assigns an object to a class • Infer that “object X is about sports” • Automatically learn the function from a set of examples Known Classes PAMI Research Group, University of Waterloo

Classifiers • Template Matching: user need to supply template and metric • NMC: nearest class mean, simple, no training • K-NN: Asymptotically optimal, slow in testing • Bayes: yields simple classifier for Gaussian distributions • NN: nonlinear, sensitive to parameters, slow training • DT: binary, transparent, sensitive to overtraining • SVM: nonlinear, insensitive to overtraining, slow, good generalization PAMI Research Group, University of Waterloo

Multiple Classifier Systems • Multiple classifier systems consist of a set of classifiers and a combination strategy. • Motivations: • Existence of many alternative classifiers each has its own feature and representation space • Existence of different training sets collected at different times and may even have different features. • Each classifier may have good performance in its own region of the feature space • Classifiers may have different patterns for making mistakes, even when they are trained on the same data PAMI Research Group, University of Waterloo

Multiple Classifier Systems Design • Design of MCS can be accomplished at 4 levels[Kuncheva 04] • Aggregation Level • Classifier Level • Feature level • Data Level PAMI Research Group, University of Waterloo

Combining Schemes • Static vs Adaptive, Fixed vs Trainable • Voting methods: Max, average, majority, Borda • Weighted average, fuzzy integrals, belief theory. • Decision Template, Behavior Knowledge space • Feature Base Architecture (Adaptive) (Wanas and Kamel 99-02) aggregation is trained and adapts to the data rather than postprocessing. • Data Level combining: partitioning technique for training multiple classifiers (Dara, .. and Kamel IF04, PR 06) that generates nearly optimal training partitions PAMI Research Group, University of Waterloo

Imbalanced Classes NB AdaBoost AdaC1 AdaC2 AdaC3 C4.5 AdaBoost AdaC1 AdaC2 AdaC3 58.25 59.26 64.11 69.08 68.91 22.78 31.58 35.16 52.73 53.85 97.13 97.98 98.28 98.31 98.42 92.50 93.63 92.63 93.35 93.91 Acc 94.63 96.15 96.73 96.80 97.00 Acc 86.32 88.34 86.77 88.34 89.24 Sun and Kamel, ICDM 2006, PR 2007) • Data Set: SchoolNet • Class size ratio: 1/12 • Performance measure: F-measure • Base classifier: Decision Trees C4.5 • Data Set: 20-Newsgroup • Class size ratio: 1/15 • Performance measure: F-measure • Base classifier: Naïve Bayesian Performance on the small size class Performance on the large size class Observations: • Performance of the base classifier on the small class is poor • AdaBoost is capable to improve classification accuracy • AdaBoost does not guarantee the improved performance on the small class • AdaC2 and AdaC3 are effective in increasing the identification performance of the small class PAMI Research Group, University of Waterloo

Dealing with time dependant data • Time series data contains dynamic information and is difficult to be modelled by any individual representation methods • Traditional classifiers for time series data like Dynamic Time Warping (DTW) are not robust • Aggregating the decisions based on different representations could provide better and more reliable performances(Chen and Lei 2004-2006) PAMI Research Group, University of Waterloo

Architecture PAMI Research Group, University of Waterloo

Experimental Results PAMI Research Group, University of Waterloo

Clustering Inter-cluster distances are maximized Intra-cluster distances are minimized • Finding groups of objects such that objects in a group are similar to one another and different from (dissimilar) objects in other groups PAMI Research Group, University of Waterloo

Clustering Approaches • Hierarchal: single link • Partitional: K-means, Fuzzy K-means, Bisecting, VQ • Density based: DBScan, Chameleon • Agglomerative: starts from individual clusters then merge • Divisive: start from one and divide • Connectionest: SOM. ART PAMI Research Group, University of Waterloo

Multi-clustering Overview of Combining Cluster Ensembles PAMI Research Group, University of Waterloo

Cluster Ensemble • Developed a prototype for cluster ensemble methods (Ayad and Kamel 2005-2007) include:- Generation of cluster ensembles based on: (1) multiple feature subsets, (2) statistical sampling techniques, and (3) variable number of clusters (multi-resolution ensembles).- Combiners of cluster ensembles based on (1) Shared nearest neighbors, (2) Different representations and distance measures between clusters, and (3) Voting. • Positive experimental results on text data, in addition to a variety of benchmark data for machine learning algorithms PAMI Research Group, University of Waterloo

Categorization using cluster ensemble PAMI Research Group, University of Waterloo

Projects Overview Image Interaction Logs Text Document Text Document Information Extraction Analyzing content to extract relevant information Categorization Organizing LOs according to their content Classification - Traditional - MCS - Imbalanced Keyword Extraction Summarization Concept Extraction Social Network Analysis - Traditional - Ensembles - Distributed Clustering Personalization Providing user-specific results Image Mining Describing and finding relevant images ReinforcementLearning - Traditional - Opposition- based CBIR - Traditional - Fusion-based Integration and Applications Software Components In Progress Publications Theme and Industry Collaboration PAMI Research Group, University of Waterloo

Information Extraction: Summarization LO Content Package Summarization • Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis. • Statistical term weighting and sentence ranking are performed on each document, and to the whole collection. • Top relevant sentences are extracted for each document. • Planned functionality: Summarization of whole modules or lessons (as opposed to single documents). • Benefits • Provide summarized overview of learning objects for quick browsing and access to learning material. • Scenarios • Learning Management Systems can call the summarization component to produce summaries for content packages. Data is courtesy University of Saskatchewan PAMI Research Group, University of Waterloo

Information Extraction: Social Network Analysis Social Network Builder • Tasks • Finding relationships between people based on their web pages • Progress • Modeling • Actors are represented by their associated documents • Links are modeled by • Pair-wise Similarity of the actors’ documents • Merging actors’ documents  relations are also modeled by documents • Learning • Some links are known learning social network is translated into text classification problem • No link is revealed  a clustering problem with very low performance PAMI Research Group, University of Waterloo

Information Extraction: Concept Extraction Concept-Based Statistical Analyser Conceptual Ontological Graph (COG) Ranking PAMI Research Group, University of Waterloo

Information Extraction: Keyword Extraction Semantic Keyword Extraction • Tasks • Developing tools and techniques to extract semantic keywords toward facilitating metadata generation • Developing algorithms to enrich metadata (tags) which can be applied in index-based multimedia retrieval • Progress • Proposed a new information theoretic inclusion index to measure the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology) • Makrehchi, M. and Kamel, ICDM07, WI 07 PAMI Research Group, University of Waterloo

Information Extraction: Keyword Extraction • Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination • Learns 20 rules from the first 50 training rules • Learns 13 additional rules from the next 220 training rules Rule-based Keyword Extraction • Learn rules to find keywords in English sentences • Rules represent sentence fragments • Specific enough for reliable keyword extraction • General enough to be applied to unseen sentences • Rule generalization • Begin with an exact sentence fragment • Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy • Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise • Keyword extraction • Find sequence of rules that best cover an unseen sentence • Extract keywords according to rules • Both precision and recall values increase during training • Precision (blue) increases 10% • Recall (red) shows slight upward trend PAMI Research Group, University of Waterloo

Categorization: Ensemble-based Clustering • Consensus Clustering • Categorization of learning objects using proposed consensus clustering algorithms. • The goal of consensus clustering is to find a clustering of the data objects that optimally summarizes an ensemble of multiple clusterings. • Consensus clustering can offer several advantages over a single data clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes. • Tasks • Development of techniques for producing ensembles of multiple data clusterings where diverse information about the structure of the data is likely to occur. • Development of consensus algorithms to aggregate the individual clusterings. • Develop solutions for the cluster symbolic-label matching problem • Empirical analysis on real-world data and validation of proposed method. PAMI Research Group, University of Waterloo

Categorization using cluster ensemble PAMI Research Group, University of Waterloo

Distributed Environments • Distributed Data MiningApplying Data Mining in an environment where the data, the mining process, or both are distributed. • Motivation • Natural distribution of data on the Web. • Scenarios that require the integration of disparate data and mining results are emerging (e.g. federation of repositories, news feed aggregation, digital libraries, business intelligence gathering, etc.) • Emerging technologies, such as Semantic Web, Web Services, Grid Computing, make it feasible to build distributed mining systems. • Availability of cheap low-end hardware that could be utilized in a distributed environment to achieve high-end goals (e.g. Google, SETI@Home, Folding@Home, etc.) PAMI Research Group, University of Waterloo

Categorization: Distributed Clustering Hierarchical P2P Document Clustering • Peer nodes are arranged into groups called “neighborhoods”. • Multiple neighborhoods are formed at each level of the hierarchy. • This size of each neighborhood is determined through a network partitioning factor. • Each neighborhood has a designated supernode. • Supernodes of level h form the neibhorhoods for level h+1. • Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes. • Benefits • Significant speedup over centralized clustering and flat peer-to-peer clustering. • Multiple levels of clusters. • Distributed summarization of clusters using CorePhrase keyphrase extraction. • Scenarios • Distributed knowledge discovery in hierarchical organizations. HP2PC Architecture HP2PC Example3-level network, 16 nodes PAMI Research Group, University of Waterloo

Categorization: Multiple Classifier Systems • Progress • Proposed a set of evaluation measures to select sub-optimal training partitions for training classifier ensembles. • Proposed an ensemble training algorithm called Clustering, De-clustering, and Selection (CDS). • Proposed and optimized a cooperative training algorithm called Cooperative Clustering, De-clustering, and Selection (CO-CDS). • Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification. • Tasks • To investigate various aspects of cooperation in Multiple Classifier Systems (Classifier Ensembles) • To develop evaluation measures in order to estimate various types of cooperation in the system • To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures • To apply these findings to optimize existing ensemble methods • To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity PAMI Research Group, University of Waterloo

Categorization: Imbalanced Class Distribution • Objective • Advance classification of multi-class imbalanced data • Tasks • To develop cost-sensitive boosting algorithm AdaC2.M1 • To improve the identification performance on the important classes • To balance classification performance among several classes PAMI Research Group, University of Waterloo

Categorization: Imbalanced Class Distribution Performance of Base Classification and AdaBoost Class Distribution Balanced performance among classes - Evaluated by G-mean PAMI Research Group, University of Waterloo

Personalization • Opposition-based Reinforcement Learning for Personalizing Image Search • Developing a reliable technique to assist users, facilitate and enhance the learning process • Personalized ORL tool assists user to observe the searched images desirable for her/him • Personalized tool gathers images of the searched results, selects a sample of them • By interacting with user and presenting the sample, it learns the user’s preferences PAMI Research Group, University of Waterloo

Personalization PAMI Research Group, University of Waterloo

Personalization Opposition-based RL algorithms: OQ(lambda) (International Joint Conference on Neural Networks-2006) and NOQ(lambda) (IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning 2007) PAMI Research Group, University of Waterloo

Image Mining: CBIR • Content based image retrieval • Build an IR system that can retrieve images based on: Textual Cues, Image content, NL Queries • Documents contain QI Image Retrieval Tool Set images • Images contain QT • Images match QI • NL Description of Image Rich Documents • Automated image tagging • Query Image QI • Query Text QT • Query Document PAMI Research Group, University of Waterloo

Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO