1 / 25

A Chinese Information Retrieval System Using SDD

A Chinese Information Retrieval System Using SDD. Introduction SDD algorithm Compute SDD Term Weighting Computing the Similarity Term Extracting System Implementation Experiments and Results Web applications Conclusion. Introduction.

caroun
Download Presentation

A Chinese Information Retrieval System Using SDD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Chinese Information RetrievalSystem Using SDD • Introduction • SDD algorithm • Compute SDD • Term Weighting • Computing the Similarity • Term Extracting • System Implementation • Experiments and Results • Web applications • Conclusion

  2. Introduction • People need to find information quickly and accurately today • Search engines help people to find information they need from the huge data collection. • Search engines are based on information retrieval models. • Traditional search engines can not get the good precision and the ratio of recall in the search.

  3. Introduction • Vector Space Model(VSM) was advocated to improve the precision and the ratio of recall in searching. • VSM is to represent individual documents and queries in a collection as a vector in a multi dimensional space. • Latent Semantic Indexing (LSI) is an improvement model of VSM. • Singular Value Decomposition (SVD) is widely used in LSI.

  4. Introduction • SVD has been used quite effectively for information retrieval. • SVD is much more expansive to compute for a large database collection. • We adopt a different matrix approximation called Semi Discrete Decomposition (SDD).

  5. SDD algorithm • A matrix is showing as following. be the number of term, and be the number of document. The number of rows is greater than or equal to its number of columns , • So the SDD of matrix of dimension k is:

  6. SDD algorithm We can also extend the equation as following: where is an m-vector, is an n-vector. The entries of and are from the set of . And is a diagonal matrix. This equation is called a k-term SDD. Since a k-term SDD needs only k floating point numbers plus k(m+n) entries from S for storage. It is inexpensive to compute quite a large number of terms.

  7. Compute SDD There are three steps for computing an SDD approximation: 1. Let be the k-term approximation, be the residual at the th step. 2. As the sub problem, solve the triplet solution with minimizes. This is a mixed integer programming problem, it can be solved as below: (a) Fixed y. (b) Solve the equation above for x and d using this y. (c) Solve the equation above for y and d using the x from step (b). (d) Repeat until convergence criterion is satisfied. 3. Repeat the step 2 until

  8. Term Weighting • In vector space model, term weighting is a very important and has great influence on a success of the retrieval system. • A matrix , . We define that is the term weighting of term in document as following: • It consists of three components, is a global weight of term , is the local weight of the term in the document , and is a normalization factor for the document .

  9. Term Weighting • The weighting scheme is usually specified by a six-letter combination that indicates local, global, and normalization components for the term document matrix. • we specify the weighting scheme as lxn.afx, and the weighting formulas can be calculated as following: otherwise otherwise

  10. Computing the Similarity • The similarity between the document and query vector is calculated by the cosine coefficient. Bellow is the formula using to compute the similarity: • the document can be arranged in descending order of similarity and the number of documents retrieved can be limited.

  11. Term Extracting • LSI model is easy to use corpus in different languages to accomplish the cross language retrieval. • We choose a morphological analysis system called ChaSen(茶筅) to extract the Chinese word from document in our system. • We need good Dictionary to separate the Chinese word correctly.

  12. Term Extracting Figure 1: Chinese Morphological analyzer

  13. System Implementation Below is an illustration showing the working mechanism of our SDD information retrieval system: Documents Collection Query string Dictionary Vectors Document Vectors Query Vector SDD Computation Rank relevant document in descending order of similarity

  14. System Implementation • Implement the SDD information retrieval system as following: • 1. Segment the terms from document collection. • 2. Create the term-document matrix in MatrixMarket Coordinate Format. • 3. Using SDDPACK to compute the term-document matrix decomposition. The command is as below: $decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd • 4. Ranking the relevant document.

  15. System Implementation Figure 2. 8 x 6 matrix outputFigure 3. SDD output %%MatrixMarket matrix coordinate real general 8 6 20 1 1 4.110885e-01 2 1 3.692579e-01 3 1 4.464557e-01 4 1 3.692579e-01 5 1 4.110885e-01 6 1 1.590307e-01 7 1 3.180615e-01 8 1 2.520578e-01 3 2 1.000000e+00 3 3 7.263057e-01 5 3 6.873719e-01 3 4 3.162278e-01 5 4 9.486833e-01 3 5 6.666667e-01 5 5 6.666667e-01 8 5 3.333333e-01 1 6 5.000000e-01 3 6 5.000000e-01 %% Semidiscrete Decomposition (SDD) %% Matrix: sdddata/matrix Terms: 5 Accr: 0.00e+00 Tol: 1.00e-02 InnIts: 100 Init: 1 5 8 6 5.7245558500289916992187500e-01 2.7197235822677612304687500e-01 4.0811389684677124023437500e-01 2.5439447164535522460937500e-01 1.3001415133476257324218750e-01 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 -1 0 0 0 1 -1 0 -1 0 0 -1 1 1 1 -1 1 -1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 -1 0 0 0 0 0 0 0 1 1 0 0 0 0 0

  16. Experiments and Results We selected a small data set in which have only 100 documents to do a test. The data is Chinese text-base documents coming from the web page of Chinese Agricultural University. For comparing the performance of SDD and SVD, we compute the matrix decomposition using both SDD and SVD. Create query vector, compute the Similarity and rank the document in descending order of similarity.

  17. Experiments and Results 1 22 0.845110 2 3 0.186189 3 49 0.164403 4 58 0.157444 5 56 0.148811 6 1 0.139891 7 9 0.105001 8 69 0.067863 9 31 0.057763 10 23 0.056919 Figure 4. Top ten entries in SVDFigure 5. Top ten entries in SDD 1 22 0.741998 2 1 0.401568 3 49 0.399059 4 3 0.398177 5 58 0.397571 6 23 0.396199 7 9 0.394032 8 12 0.391085 9 14 0.389590 10 31 0.380483

  18. Web Applications • We developed a web-based application for the presentation of this Chinese information retrieval systems. • Visiting this side by using the address of http://pc110.narc.affrc.go.jp/Chinese/. • We also developed a Japanese system using SDD-base VSM. The web interface shows at the address of http://pc110.narc.affrc.go.jp/AgrInfo/.

  19. Web Applications

  20. Web Applications

  21. Web Applications

  22. Web Applications

  23. Web Applications

  24. Conclusion • We presented a Chinese Information retrieval system by using SDD. • SDD has good advantage in saving storage of computer resources. • SDD will be easy to implement for a big data collection. • SDD will be easy to accomplish the cross language retrieval. • SDD has almost the same retrieval performance compared with SVD.

  25. Thank you!

More Related