Learning Object Metadata Mining

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel

Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

Metadata Mining • Metadata Definition • Data about data, for example a library catalogue • Metadata Application: • Cataloging (Item and Collections) • Resource Discovery • Electronic Commerce and Digital Signatures • Intelligent Software Agents • Content Rating • Intellectual Property Rights • Semantic Web • Learning Objects • LOM Standards: IEEE LOM, DC, SCORM, CANCORE Makrehchi & Kamel

Metadata Mining • Definition • extraction of implicit, previously unknown, and potentially useful information from metadata. • Methods • classification, clustering, summarization, mining association rules, ontology extraction, information integration, keyword extraction, automatic title generation. Makrehchi & Kamel

Metadata Mining • Why metadata mining? • No access to the data itself, lack of raw data, • The data is not convenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, and need to merge different metadata repositories, • Ontology extraction is much easier in metadata level. Makrehchi & Kamel

Metadata Mining Makrehchi & Kamel Conceptual data architecture

Metadata Mining • Applications • Metadata mining instead of raw data mining, • Metadata enrichment (keyword extraction) • (Semi)-automatic Ontology extraction, • RDF, OWL and other semantic tagged script mining, • Information integration (LOs aggregation and integration), Makrehchi & Kamel

Metadata Mining • Statistical methods based on word frequency analysis, • Syntactic methods based on linguistic parsing and pattern matching, • Structural methods studying the outline of the document, • Conceptual (semantic) methods on the use of knowledge base to interpret the meaning. Makrehchi & Kamel

Metadata Mining • We don’t use • Natural Language Processing (NLP), • Semantic analysis and processing, • Graph, tree and other sophisticate data structures and models, • Dictionaries, thesauruses, and any other global vocabularies (only a simple Porter stemmer). Makrehchi & Kamel

Metadata Representation Model • We treat metadata as a text document (semi-structured format), • The only measures are • statistical measures (like frequency) • geometric features (like location of a specific term, the order of words in a term or phrase) Makrehchi & Kamel

Metadata Representation Model • Vector Space Model T Vocabulary Makrehchi & Kamel di

Metadata Representation Model • Multi-Partition Vector Space Model T Vocabulary Makrehchi & Kamel di

Metadata Representation Model • Multi-Partition Vector Space Model Makrehchi & Kamel

Metadata Representation Model • Converting to standard vector model Makrehchi & Kamel

Metadata Representation Model • Weight of each partition • To be determined by expert, for example: Wabstract=1.0, Wtitile=1.5. • Membership degree of each term in every partition • By expert, • Frequency based measures (tfidf), • Geometric measures (location of each term in the partition). Makrehchi & Kamel

Class-Term Matrix • Document-Term Matrix (Collection X Vocabulary) • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Makrehchi & Kamel

Class-Term Matrix • Class-Term Matrix (Class X Vocabulary) • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is still dual with respect to terms and classes. Makrehchi & Kamel

Class-Term Matrix Class-term Frequency Term significance measure Normalized term significance measure Makrehchi & Kamel

Class-Term Matrix Makrehchi & Kamel

Class-Term Matrix • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Makrehchi & Kamel

Class-Term Matrix • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Makrehchi & Kamel

Case Study • Data set • There is no available LO metadata repository • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Makrehchi & Kamel

Case Study Makrehchi & Kamel

Case Study • Types of Frequency Measures • Within document: by document-term frequency (like tfidf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Makrehchi & Kamel

Case Study • Term Clustering: Categorizing all terms into three main groups • Features: More frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes • Introducing Class-Collection Map • To visualize the location of each category Makrehchi & Kamel

Case Study Makrehchi & Kamel

Case Study • Extraction of Stopwords (doesn’t contribute to the meaning of the document) • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State, • Medicine: Patient, • Education: Learner, Instructor, • Social sciences: Society, • Anthropology: Human. Makrehchi & Kamel

Case Study • Why we need to remove domain specific stopwords? • Dimensionality reduction, • Accurate feature selection (drawbacks of information gain in selecting noise as feature) • Based on stopwords, we can find and separate phrases (based on our definition, a phrase is a set of words between two stopwords). Makrehchi & Kamel

Makrehchi & Kamel

Case Study • Dimensionality reduction process ~400,000 Using metadata stemming 15,971 12,044 Multi-partition document Vector space model Fuzzy-based term clustering 5,605 Makrehchi & Kamel 226 features 507 stopwords 4,872 keywords

Conclusion Remarks • Most statistic-based data mining methods do not use domain knowledge • Metadata (semi-structured data) mining uses domain knowledge embedded in tags and partitions. • We introduced multi-partition document vector space model. • We mine class-term matrix in addition to document-term matrix. Makrehchi & Kamel

Conclusion Remarks • Based on the visualization model (class-collection map) and a fuzzy inference, we can cluster vocabulary for each class and extract three essential categories; • Features: to classify unknown documents, • Keywords: for indexing and access to specific document in IR applications, • Stopwords: for dimensionality reduction and noise removal. Makrehchi & Kamel

Conclusion Remarks • Based on class-term matrix, we defined • Terminologies as fuzzy sets of all terms in the vocabulary • Definitions as fuzzy sets of all concepts Makrehchi & Kamel

Conclusion Remarks • Future Works • Collecting LO metadata and constructing a LO metadata repository, • A keyword recall method to test and validate extracted keywords, • Implementing an average classifier (KNN or Fuzzy classifier) to test and validate selected features, • Applying multi-classifier architecture on metadata mining problem. Makrehchi & Kamel

Learning Object Metadata Mining

Learning Object Metadata Mining

Presentation Transcript

Learning Object

Object-based learning

Learning Object Metadata Application Profiles: Lithuanian Approach

IMS Learning Object Metadata

IEEE P1484.12 LTSC Learning Object Metadata (LOM) Wayne Hodgins – Chair

Learning Object Metadata

Design of metadata surrogates in search result interfaces of learning object repositories:

Metadata for Learning

Learning Object?

Standards, Metadata, Learning Object Repositories, Portals

Metadata for OBJECTS or metadata for LEARNING?

Metadata Object Description Schema (MODS)

Metadata Mining

Learning Object Repository

Learning Text Mining

Incorporating Educational Vocabulary in Learning Object Metadata Schemas

Learning Objects Metadata

Similarity-Based Object Metadata Browser

Metadata for Learning

Text Metadata Mining: Exploring its potential*

Learning Object?