Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Self-organizing maps applied to information retrieval of dissertations and thesesfrom BDTD-UFPE Bruno Pinheirobfp@cin.ufpe.br Renato Correarenato.correa@ufpe.br

Guide • Information Retrieval Systems (IRS) • IRS + SOM • Related Works • Document Collection • System Architecture • Methodology • Results

InformationRetrieval Systems (IRS) • Indexing, Searching , classifying textual documents. • User’sinformationneeds • Matchinguser’squeriesandsystem’svocabulary.

IRS + SOM Self-OrganizedMaps InformationRetrieval System

IRS + SOM • Navigation Interface build troughdocumentmaps • Document’smaps • Self-Organizing Map trained with document vectors

Related Works • First Works (1991 - 1995) • Lin / Merkl • Great projects(1996 -2000) • Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005) • LiGHtSOM, GHSOM, H2SOM • Convergence (2006)

Document Collection • UFPE Digital LibraryofThesesandDissertations(BDTD-UFPE) • Offers in full all the theses and dissertations produced on the graduate programs of the university. • Approximately 6000 documents. • Linked to Brazilian BDTD and to NDLTD (Networked Digital LibraryofThesesandDissertations)

Document Acquisition Documents’ content Document Indexing InvertedIndex Document Representation DocumentVectors Dimensionality Reduction ReducedVectors Volume Reduction PrototypeVectors Construction of Document Map DocumentMap Construction of User Interface System Architecture

Methodology • Document Acquisition • Harvesting process through the OAI-PMH protocol • XMLscontainingdocument’smetadata • Data extraction through the java library JColtrane

Methodology • Indexing • Java library, Lucene. • Stemmingoperations, digitsandstopwordselimination. • Inverted index built through vectorial space model.

Methodology • Documentrepresentation • Documents are representedbyvectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

Methodology • Dimensionalityreduction • Feature selection based on words’ frequency • Stopwords elimination • Final dimensionality: 13095 terms • Volume reduction • Not used. • Volume : 4781 documents

Methodology • Document’smapconstruction • Single stage • somtoolbox functions for MATLAB • Document’s vectors normalized before training • SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

Methodology • Document’smapconstruction • Weights initialized linearly along the two greatest eigenvectors • Batch-type SOM algorithm with dot product metric • Gaussian neighborhood function • Neighborhood size linearly decreasing with the number of epochs

Methodology • Document’smapconstruction • Parameters • Number of epochs • Rough phase : 10 epochs • Fine-tuning phase : 10 epoch • Neighborhood size • Rough phase • Initial: [(biggest dimension units number )/2 ]+ 1 • Final: 2 • Fine-tuning phase: • Initial: 2 • Final: 0.8

Methodology • User’s interface construction • Documents are mapped to the node with the closest model vector in terms of cosine distance • Each map node is labeled according to the category • Knowledge areas (CHLA, CBS, TCEN) • Graduate programs

Results

Results KnowledgeAreas GraduatePrograms

Acknowledgement

THANK YOU! Questions? Bruno Pinheiro bfp@cin.ufpe.br Renato Correa renato.correa@ufpe.br

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE