70 likes | 191 Views
Information Retrieval Project. Creation of clusters of concepts that represent a domain corpus. Background. Vector Space Model. Knowledge-Based Vector Space Model. Wikipedia as a knowledge domain . BOW indexing versus knowledge-based indexing. Indexing Wikipedia .
E N D
Information Retrieval Project Creation of clusters of concepts that represent a domain corpus.
Background • Vector Space Model. • Knowledge-Based Vector Space Model. • Wikipedia as a knowledge domain. • BOW indexing versus knowledge-based indexing. • Indexing Wikipedia. • Wikipedia-based concept clustering
Knowledge-based VSM for text Clustering • Problem Definition: • Creating clusters of related concepts, each cluster represents a specific knowledge domain. • Creation of The knowledge-based Vectors for documents in a given corpus based on term similarity measures in each document.
Given: • Wikipedia index. • Working Code for Knowledge-based corpus indexes. • Working code to define term-term relatedness weight. • Working Similarity code “To extract a similar document to an existing one from Wikipedia”. • Algorithm for Document Clustering based on the Wikipedia structure”.
Email me @ • eea7236@louisiana.edu • Elshaimaa.ali@hotmail.com
Required To implement: • Building a knowledge-based VSM Index for documents in two different domain corpuses using the term similarity code given. • Implementation of the Wikipedia Structure-based given clustering Algorithm.
Tools that will be used • Wikipedia Database Dumps. (MySql Database). • JWPL API to access the Wikipedia database dumps. • Lucene API to build indexes. • Assistance and codes will be provided to help using the APIs