460 likes | 645 Views
Learning Object Metadata Mining. Masoud Makrehchi Supervisor: Prof. Mohamed Kamel. Outlines. Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks. Metadata Mining. Metadata Definition Data about data, for example a library catalogue
E N D
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel
Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel
Metadata Mining • Metadata Definition • Data about data, for example a library catalogue • Metadata Application: • Cataloging (Item and Collections) • Resource Discovery • Electronic Commerce and Digital Signatures • Intelligent Software Agents • Content Rating • Intellectual Property Rights • Semantic Web • Learning Objects • LOM Standards: IEEE LOM, DC, SCORM, CANCORE Makrehchi & Kamel
Metadata Mining • Definition • extraction of implicit, previously unknown, and potentially useful information from metadata. • Methods • classification, clustering, summarization, mining association rules, ontology extraction, information integration, keyword extraction, automatic title generation. Makrehchi & Kamel
Metadata Mining • Why metadata mining? • No access to the data itself, lack of raw data, • The data is not convenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, and need to merge different metadata repositories, • Ontology extraction is much easier in metadata level. Makrehchi & Kamel
Metadata Mining Makrehchi & Kamel Conceptual data architecture
Metadata Mining • Applications • Metadata mining instead of raw data mining, • Metadata enrichment (keyword extraction) • (Semi)-automatic Ontology extraction, • RDF, OWL and other semantic tagged script mining, • Information integration (LOs aggregation and integration), Makrehchi & Kamel
Metadata Mining • Statistical methods based on word frequency analysis, • Syntactic methods based on linguistic parsing and pattern matching, • Structural methods studying the outline of the document, • Conceptual (semantic) methods on the use of knowledge base to interpret the meaning. Makrehchi & Kamel
Metadata Mining • We don’t use • Natural Language Processing (NLP), • Semantic analysis and processing, • Graph, tree and other sophisticate data structures and models, • Dictionaries, thesauruses, and any other global vocabularies (only a simple Porter stemmer). Makrehchi & Kamel
Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel
Metadata Representation Model • We treat metadata as a text document (semi-structured format), • The only measures are • statistical measures (like frequency) • geometric features (like location of a specific term, the order of words in a term or phrase) Makrehchi & Kamel
Metadata Representation Model • Vector Space Model T Vocabulary Makrehchi & Kamel di
Metadata Representation Model • Multi-Partition Vector Space Model T Vocabulary Makrehchi & Kamel di
Metadata Representation Model • Multi-Partition Vector Space Model Makrehchi & Kamel
Metadata Representation Model • Converting to standard vector model Makrehchi & Kamel
Metadata Representation Model • Weight of each partition • To be determined by expert, for example: Wabstract=1.0, Wtitile=1.5. • Membership degree of each term in every partition • By expert, • Frequency based measures (tfidf), • Geometric measures (location of each term in the partition). Makrehchi & Kamel
Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel
Class-Term Matrix • Document-Term Matrix (Collection X Vocabulary) • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Makrehchi & Kamel
Class-Term Matrix • Class-Term Matrix (Class X Vocabulary) • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is still dual with respect to terms and classes. Makrehchi & Kamel
Class-Term Matrix Class-term Frequency Term significance measure Normalized term significance measure Makrehchi & Kamel
Class-Term Matrix Makrehchi & Kamel
Class-Term Matrix • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Makrehchi & Kamel
Class-Term Matrix • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Makrehchi & Kamel
Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel
Case Study • Data set • There is no available LO metadata repository • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Makrehchi & Kamel
Case Study Makrehchi & Kamel
Case Study Makrehchi & Kamel
Case Study • Types of Frequency Measures • Within document: by document-term frequency (like tfidf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Makrehchi & Kamel
Case Study • Term Clustering: Categorizing all terms into three main groups • Features: More frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes • Introducing Class-Collection Map • To visualize the location of each category Makrehchi & Kamel
Case Study Makrehchi & Kamel
Case Study Makrehchi & Kamel
Case Study Makrehchi & Kamel
Case Study • Extraction of Stopwords (doesn’t contribute to the meaning of the document) • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State, • Medicine: Patient, • Education: Learner, Instructor, • Social sciences: Society, • Anthropology: Human. Makrehchi & Kamel
Case Study • Why we need to remove domain specific stopwords? • Dimensionality reduction, • Accurate feature selection (drawbacks of information gain in selecting noise as feature) • Based on stopwords, we can find and separate phrases (based on our definition, a phrase is a set of words between two stopwords). Makrehchi & Kamel
Case Study • Dimensionality reduction process ~400,000 Using metadata stemming 15,971 12,044 Multi-partition document Vector space model Fuzzy-based term clustering 5,605 Makrehchi & Kamel 226 features 507 stopwords 4,872 keywords
Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel
Conclusion Remarks • Most statistic-based data mining methods do not use domain knowledge • Metadata (semi-structured data) mining uses domain knowledge embedded in tags and partitions. • We introduced multi-partition document vector space model. • We mine class-term matrix in addition to document-term matrix. Makrehchi & Kamel
Conclusion Remarks • Based on the visualization model (class-collection map) and a fuzzy inference, we can cluster vocabulary for each class and extract three essential categories; • Features: to classify unknown documents, • Keywords: for indexing and access to specific document in IR applications, • Stopwords: for dimensionality reduction and noise removal. Makrehchi & Kamel
Conclusion Remarks • Based on class-term matrix, we defined • Terminologies as fuzzy sets of all terms in the vocabulary • Definitions as fuzzy sets of all concepts Makrehchi & Kamel
Conclusion Remarks • Future Works • Collecting LO metadata and constructing a LO metadata repository, • A keyword recall method to test and validate extracted keywords, • Implementing an average classifier (KNN or Fuzzy classifier) to test and validate selected features, • Applying multi-classifier architecture on metadata mining problem. Makrehchi & Kamel