200 likes | 293 Views
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE. Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br. Guide. Information Retrieval Systems (IRS) IRS + SOM Related Works Document Collection System Architecture Methodology
E N D
Self-organizing maps applied to information retrieval of dissertations and thesesfrom BDTD-UFPE Bruno Pinheirobfp@cin.ufpe.br Renato Correarenato.correa@ufpe.br
Guide • Information Retrieval Systems (IRS) • IRS + SOM • Related Works • Document Collection • System Architecture • Methodology • Results
InformationRetrieval Systems (IRS) • Indexing, Searching , classifying textual documents. • User’sinformationneeds • Matchinguser’squeriesandsystem’svocabulary.
IRS + SOM Self-OrganizedMaps InformationRetrieval System
IRS + SOM • Navigation Interface build troughdocumentmaps • Document’smaps • Self-Organizing Map trained with document vectors
Related Works • First Works (1991 - 1995) • Lin / Merkl • Great projects(1996 -2000) • Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005) • LiGHtSOM, GHSOM, H2SOM • Convergence (2006)
Document Collection • UFPE Digital LibraryofThesesandDissertations(BDTD-UFPE) • Offers in full all the theses and dissertations produced on the graduate programs of the university. • Approximately 6000 documents. • Linked to Brazilian BDTD and to NDLTD (Networked Digital LibraryofThesesandDissertations)
Document Acquisition Documents’ content Document Indexing InvertedIndex Document Representation DocumentVectors Dimensionality Reduction ReducedVectors Volume Reduction PrototypeVectors Construction of Document Map DocumentMap Construction of User Interface System Architecture
Methodology • Document Acquisition • Harvesting process through the OAI-PMH protocol • XMLscontainingdocument’smetadata • Data extraction through the java library JColtrane
Methodology • Indexing • Java library, Lucene. • Stemmingoperations, digitsandstopwordselimination. • Inverted index built through vectorial space model.
Methodology • Documentrepresentation • Documents are representedbyvectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.
Methodology • Dimensionalityreduction • Feature selection based on words’ frequency • Stopwords elimination • Final dimensionality: 13095 terms • Volume reduction • Not used. • Volume : 4781 documents
Methodology • Document’smapconstruction • Single stage • somtoolbox functions for MATLAB • Document’s vectors normalized before training • SOM map with rectangular structure (10 x 12) and hexagonal neighborhood
Methodology • Document’smapconstruction • Weights initialized linearly along the two greatest eigenvectors • Batch-type SOM algorithm with dot product metric • Gaussian neighborhood function • Neighborhood size linearly decreasing with the number of epochs
Methodology • Document’smapconstruction • Parameters • Number of epochs • Rough phase : 10 epochs • Fine-tuning phase : 10 epoch • Neighborhood size • Rough phase • Initial: [(biggest dimension units number )/2 ]+ 1 • Final: 2 • Fine-tuning phase: • Initial: 2 • Final: 0.8
Methodology • User’s interface construction • Documents are mapped to the node with the closest model vector in terms of cosine distance • Each map node is labeled according to the category • Knowledge areas (CHLA, CBS, TCEN) • Graduate programs
Results KnowledgeAreas GraduatePrograms
THANK YOU! Questions? Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br