1 / 23

Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks. Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia mokris@aoslm.sk, skovajsova@aoslm.sk. Summary.

ledell
Download Presentation

Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia mokris@aoslm.sk, skovajsova@aoslm.sk

  2. Summary Development of the neural network model for information retrieval from text documents in Slovak language by vector space model of document representation Key words:Information Retrieval, Queries, Keywords, Text Documents, Neural Networks, Slovak Language

  3. Text Document Analysis The most common approaches : • Statistical – analyses words in text documents comparing them with keywords • Linguistic – extracts linguistic units from text – phoneme, morpheme, lexeme, ... • Knowledge – based – uses domain models of documents descripted by ontology • Porter algorithm for English

  4. Slovak language is more complicated • Inflection of Slovak language – grammatical forms – nouns, adjectives, pronouns, ... • Complicated word – timing and declension, prefixes and suffixes, ... • Synonyms and homonyms • Phrases containing more than one word, • And so on

  5. System for Information Retrievalin STDFurdík, K.: Inf. Retrieval in Nat. Language by Hypertext Structure, 2003. User Indexation Document Administrator

  6. How continue Utilization of Neural Networks Well trained NN is able: • to simplify the Slovak text analysis, • is invariance from point of Slovak words infection, • perform faster linguistic analysis Disadvantage: • problems with learning and static structure of NN

  7. System for Information Retrieval Can be simplified

  8. It means – 3 Layer Information Retrieval System • Most simplified structure of system Keywords Documents Queries

  9. Next solution - Representation the query, keywords and document layer by neural networks

  10. Development of 1st NN for Keyword Determination 1st NN – Feed-Forward NN Back-Prop Type

  11. Development of 2nd NN for Document Determination– Vector Space Model K(m x n) – Vector Space Matrix kkd – frequency of keywords in documents k – number of keywords d – number of documents

  12. NN with Spreading Activation Function Determination of Documents

  13. NN with Spreading Activation Function • SAF NN is not learning • Weights are setting by equation W = K

  14. Experiments • Model of cascade NN in Matlab • Query layer - 12 characters • Keyword layer - 20 keywords • Document layer - 90 documents • Each document - app. 50 words • QTrS – 164 queries of training set • KwTrS – 20 keywords of training set • 2nd NN is not trained

  15. Experiments • 1st experiment • QTsS1 – 185 queries, questions from QTsS1 belonging keywords from KwTrS • Precision 0,996 • 2nd experiment • QTsS1 – 100 queries, questions from QTsS1 belonging no keywords from KwTrS • Precision 0,97

  16. Disadvantage of VS Model Approach • Great dimension of VS matrix • Next approach – Dimension reduction of VS matrix – Latent Semantic Model

  17. Latent Semantic Model Singular Value Decomposition of Vector Space Matrix K K = U S VT • U – row – oriented eigen vectors of K.KT • V – column – oriented eigen vectors of K.KT • S – diagonal matrix of singular values of K.KT dim (S) < dim (K)

  18. VS Matrix Dimension Reduction – Truncated SVD Sr< S k – number of singular valuessi r < k r – number of siafter dimension reduction Number of elements of reduced matrices is lower then number of elements in the matrix K

  19. Solution of Dimension VSM Reduction Document relevance D is defined by: D = Q x K, Q – set of queries K – VS matrix Reduced document relevance Dr is defined by: Dr = Q x Kr, Kr = U.Sr.VT – reduced VS matrix

  20. Experiments • Collection of 90 documents with 20 keywords – vector space matrix • Dimension reduction by truncated singular value decomposition • For chosen number of singular values computation the precision, recall, absolute and relative number of element kil

  21. Evaluation of Experiments – precision, recall, number of elements VSred R - recall R = nretrel / nrel • nretrel – number of retrieved relevant documents • nrel – number of relevant documents P – precision P = nretrel / nret • ret – number of retrieved documents

  22. Results siPrecision Recall Absolute Relative 1 0,7942 0,24 110 0,632 2 0,95 0,314121 0,695 3 0,95 0,405137 0,787 5 0,975 0,512 148 0,850 7 0,977 0,634161 0,925 10 1,0 0,754 165 0,948 15 1,0 0,95 173 0,994 20 1,0 1,0 174 1,0

  23. Conclusion • follows from table

More Related