Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia mokris@aoslm.sk, skovajsova@aoslm.sk

Summary Development of the neural network model for information retrieval from text documents in Slovak language by vector space model of document representation Key words:Information Retrieval, Queries, Keywords, Text Documents, Neural Networks, Slovak Language

Text Document Analysis The most common approaches : • Statistical – analyses words in text documents comparing them with keywords • Linguistic – extracts linguistic units from text – phoneme, morpheme, lexeme, ... • Knowledge – based – uses domain models of documents descripted by ontology • Porter algorithm for English

Slovak language is more complicated • Inflection of Slovak language – grammatical forms – nouns, adjectives, pronouns, ... • Complicated word – timing and declension, prefixes and suffixes, ... • Synonyms and homonyms • Phrases containing more than one word, • And so on

System for Information Retrievalin STDFurdík, K.: Inf. Retrieval in Nat. Language by Hypertext Structure, 2003. User Indexation Document Administrator

How continue Utilization of Neural Networks Well trained NN is able: • to simplify the Slovak text analysis, • is invariance from point of Slovak words infection, • perform faster linguistic analysis Disadvantage: • problems with learning and static structure of NN

System for Information Retrieval Can be simplified

It means – 3 Layer Information Retrieval System • Most simplified structure of system Keywords Documents Queries

Next solution - Representation the query, keywords and document layer by neural networks

Development of 1st NN for Keyword Determination 1st NN – Feed-Forward NN Back-Prop Type

Development of 2nd NN for Document Determination– Vector Space Model K(m x n) – Vector Space Matrix kkd – frequency of keywords in documents k – number of keywords d – number of documents

NN with Spreading Activation Function Determination of Documents

NN with Spreading Activation Function • SAF NN is not learning • Weights are setting by equation W = K

Experiments • Model of cascade NN in Matlab • Query layer - 12 characters • Keyword layer - 20 keywords • Document layer - 90 documents • Each document - app. 50 words • QTrS – 164 queries of training set • KwTrS – 20 keywords of training set • 2nd NN is not trained

Experiments • 1st experiment • QTsS1 – 185 queries, questions from QTsS1 belonging keywords from KwTrS • Precision 0,996 • 2nd experiment • QTsS1 – 100 queries, questions from QTsS1 belonging no keywords from KwTrS • Precision 0,97

Disadvantage of VS Model Approach • Great dimension of VS matrix • Next approach – Dimension reduction of VS matrix – Latent Semantic Model

Latent Semantic Model Singular Value Decomposition of Vector Space Matrix K K = U S VT • U – row – oriented eigen vectors of K.KT • V – column – oriented eigen vectors of K.KT • S – diagonal matrix of singular values of K.KT dim (S) < dim (K)

VS Matrix Dimension Reduction – Truncated SVD Sr< S k – number of singular valuessi r < k r – number of siafter dimension reduction Number of elements of reduced matrices is lower then number of elements in the matrix K

Solution of Dimension VSM Reduction Document relevance D is defined by: D = Q x K, Q – set of queries K – VS matrix Reduced document relevance Dr is defined by: Dr = Q x Kr, Kr = U.Sr.VT – reduced VS matrix

Experiments • Collection of 90 documents with 20 keywords – vector space matrix • Dimension reduction by truncated singular value decomposition • For chosen number of singular values computation the precision, recall, absolute and relative number of element kil

Evaluation of Experiments – precision, recall, number of elements VSred R - recall R = nretrel / nrel • nretrel – number of retrieved relevant documents • nrel – number of relevant documents P – precision P = nretrel / nret • ret – number of retrieved documents

Results siPrecision Recall Absolute Relative 1 0,7942 0,24 110 0,632 2 0,95 0,314121 0,695 3 0,95 0,405137 0,787 5 0,975 0,512 148 0,850 7 0,977 0,634161 0,925 10 1,0 0,754 165 0,948 15 1,0 0,95 173 0,994 20 1,0 1,0 174 1,0

Conclusion • follows from table

Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Presentation Transcript

DIG for Disease Informatics Group

Medical Informatics:

CIS-305: Data Structures

Advisor: Prof. Francesco Tisato Tutor: Prof. Carla Simone September 23th 2009

Medical Informatics:

Biomedical Informatics Year in Review

Disease Informatics: The burden of disease

COOLER WORLD (2010 – 2012)

Medical Informatics

Health Informatics

Basic Macroeconomics Parameters

Biodiversity Informatics

Health Informatics

Grids for Chemical Informatics

Building a Chemical Informatics Grid

Capital Technology Information Services, Inc.

Biomedical Informatics Hub