230 likes | 417 Views
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks. Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia mokris@aoslm.sk, skovajsova@aoslm.sk. Summary.
E N D
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia mokris@aoslm.sk, skovajsova@aoslm.sk
Summary Development of the neural network model for information retrieval from text documents in Slovak language by vector space model of document representation Key words:Information Retrieval, Queries, Keywords, Text Documents, Neural Networks, Slovak Language
Text Document Analysis The most common approaches : • Statistical – analyses words in text documents comparing them with keywords • Linguistic – extracts linguistic units from text – phoneme, morpheme, lexeme, ... • Knowledge – based – uses domain models of documents descripted by ontology • Porter algorithm for English
Slovak language is more complicated • Inflection of Slovak language – grammatical forms – nouns, adjectives, pronouns, ... • Complicated word – timing and declension, prefixes and suffixes, ... • Synonyms and homonyms • Phrases containing more than one word, • And so on
System for Information Retrievalin STDFurdík, K.: Inf. Retrieval in Nat. Language by Hypertext Structure, 2003. User Indexation Document Administrator
How continue Utilization of Neural Networks Well trained NN is able: • to simplify the Slovak text analysis, • is invariance from point of Slovak words infection, • perform faster linguistic analysis Disadvantage: • problems with learning and static structure of NN
System for Information Retrieval Can be simplified
It means – 3 Layer Information Retrieval System • Most simplified structure of system Keywords Documents Queries
Next solution - Representation the query, keywords and document layer by neural networks
Development of 1st NN for Keyword Determination 1st NN – Feed-Forward NN Back-Prop Type
Development of 2nd NN for Document Determination– Vector Space Model K(m x n) – Vector Space Matrix kkd – frequency of keywords in documents k – number of keywords d – number of documents
NN with Spreading Activation Function Determination of Documents
NN with Spreading Activation Function • SAF NN is not learning • Weights are setting by equation W = K
Experiments • Model of cascade NN in Matlab • Query layer - 12 characters • Keyword layer - 20 keywords • Document layer - 90 documents • Each document - app. 50 words • QTrS – 164 queries of training set • KwTrS – 20 keywords of training set • 2nd NN is not trained
Experiments • 1st experiment • QTsS1 – 185 queries, questions from QTsS1 belonging keywords from KwTrS • Precision 0,996 • 2nd experiment • QTsS1 – 100 queries, questions from QTsS1 belonging no keywords from KwTrS • Precision 0,97
Disadvantage of VS Model Approach • Great dimension of VS matrix • Next approach – Dimension reduction of VS matrix – Latent Semantic Model
Latent Semantic Model Singular Value Decomposition of Vector Space Matrix K K = U S VT • U – row – oriented eigen vectors of K.KT • V – column – oriented eigen vectors of K.KT • S – diagonal matrix of singular values of K.KT dim (S) < dim (K)
VS Matrix Dimension Reduction – Truncated SVD Sr< S k – number of singular valuessi r < k r – number of siafter dimension reduction Number of elements of reduced matrices is lower then number of elements in the matrix K
Solution of Dimension VSM Reduction Document relevance D is defined by: D = Q x K, Q – set of queries K – VS matrix Reduced document relevance Dr is defined by: Dr = Q x Kr, Kr = U.Sr.VT – reduced VS matrix
Experiments • Collection of 90 documents with 20 keywords – vector space matrix • Dimension reduction by truncated singular value decomposition • For chosen number of singular values computation the precision, recall, absolute and relative number of element kil
Evaluation of Experiments – precision, recall, number of elements VSred R - recall R = nretrel / nrel • nretrel – number of retrieved relevant documents • nrel – number of relevant documents P – precision P = nretrel / nret • ret – number of retrieved documents
Results siPrecision Recall Absolute Relative 1 0,7942 0,24 110 0,632 2 0,95 0,314121 0,695 3 0,95 0,405137 0,787 5 0,975 0,512 148 0,850 7 0,977 0,634161 0,925 10 1,0 0,754 165 0,948 15 1,0 0,95 173 0,994 20 1,0 1,0 174 1,0
Conclusion • follows from table