250 likes | 459 Views
A Chinese Information Retrieval System Using SDD. Introduction SDD algorithm Compute SDD Term Weighting Computing the Similarity Term Extracting System Implementation Experiments and Results Web applications Conclusion. Introduction.
E N D
A Chinese Information RetrievalSystem Using SDD • Introduction • SDD algorithm • Compute SDD • Term Weighting • Computing the Similarity • Term Extracting • System Implementation • Experiments and Results • Web applications • Conclusion
Introduction • People need to find information quickly and accurately today • Search engines help people to find information they need from the huge data collection. • Search engines are based on information retrieval models. • Traditional search engines can not get the good precision and the ratio of recall in the search.
Introduction • Vector Space Model(VSM) was advocated to improve the precision and the ratio of recall in searching. • VSM is to represent individual documents and queries in a collection as a vector in a multi dimensional space. • Latent Semantic Indexing (LSI) is an improvement model of VSM. • Singular Value Decomposition (SVD) is widely used in LSI.
Introduction • SVD has been used quite effectively for information retrieval. • SVD is much more expansive to compute for a large database collection. • We adopt a different matrix approximation called Semi Discrete Decomposition (SDD).
SDD algorithm • A matrix is showing as following. be the number of term, and be the number of document. The number of rows is greater than or equal to its number of columns , • So the SDD of matrix of dimension k is:
SDD algorithm We can also extend the equation as following: where is an m-vector, is an n-vector. The entries of and are from the set of . And is a diagonal matrix. This equation is called a k-term SDD. Since a k-term SDD needs only k floating point numbers plus k(m+n) entries from S for storage. It is inexpensive to compute quite a large number of terms.
Compute SDD There are three steps for computing an SDD approximation: 1. Let be the k-term approximation, be the residual at the th step. 2. As the sub problem, solve the triplet solution with minimizes. This is a mixed integer programming problem, it can be solved as below: (a) Fixed y. (b) Solve the equation above for x and d using this y. (c) Solve the equation above for y and d using the x from step (b). (d) Repeat until convergence criterion is satisfied. 3. Repeat the step 2 until
Term Weighting • In vector space model, term weighting is a very important and has great influence on a success of the retrieval system. • A matrix , . We define that is the term weighting of term in document as following: • It consists of three components, is a global weight of term , is the local weight of the term in the document , and is a normalization factor for the document .
Term Weighting • The weighting scheme is usually specified by a six-letter combination that indicates local, global, and normalization components for the term document matrix. • we specify the weighting scheme as lxn.afx, and the weighting formulas can be calculated as following: otherwise otherwise
Computing the Similarity • The similarity between the document and query vector is calculated by the cosine coefficient. Bellow is the formula using to compute the similarity: • the document can be arranged in descending order of similarity and the number of documents retrieved can be limited.
Term Extracting • LSI model is easy to use corpus in different languages to accomplish the cross language retrieval. • We choose a morphological analysis system called ChaSen(茶筅) to extract the Chinese word from document in our system. • We need good Dictionary to separate the Chinese word correctly.
Term Extracting Figure 1: Chinese Morphological analyzer
System Implementation Below is an illustration showing the working mechanism of our SDD information retrieval system: Documents Collection Query string Dictionary Vectors Document Vectors Query Vector SDD Computation Rank relevant document in descending order of similarity
System Implementation • Implement the SDD information retrieval system as following: • 1. Segment the terms from document collection. • 2. Create the term-document matrix in MatrixMarket Coordinate Format. • 3. Using SDDPACK to compute the term-document matrix decomposition. The command is as below: $decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd • 4. Ranking the relevant document.
System Implementation Figure 2. 8 x 6 matrix outputFigure 3. SDD output %%MatrixMarket matrix coordinate real general 8 6 20 1 1 4.110885e-01 2 1 3.692579e-01 3 1 4.464557e-01 4 1 3.692579e-01 5 1 4.110885e-01 6 1 1.590307e-01 7 1 3.180615e-01 8 1 2.520578e-01 3 2 1.000000e+00 3 3 7.263057e-01 5 3 6.873719e-01 3 4 3.162278e-01 5 4 9.486833e-01 3 5 6.666667e-01 5 5 6.666667e-01 8 5 3.333333e-01 1 6 5.000000e-01 3 6 5.000000e-01 %% Semidiscrete Decomposition (SDD) %% Matrix: sdddata/matrix Terms: 5 Accr: 0.00e+00 Tol: 1.00e-02 InnIts: 100 Init: 1 5 8 6 5.7245558500289916992187500e-01 2.7197235822677612304687500e-01 4.0811389684677124023437500e-01 2.5439447164535522460937500e-01 1.3001415133476257324218750e-01 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 -1 0 0 0 1 -1 0 -1 0 0 -1 1 1 1 -1 1 -1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 -1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
Experiments and Results We selected a small data set in which have only 100 documents to do a test. The data is Chinese text-base documents coming from the web page of Chinese Agricultural University. For comparing the performance of SDD and SVD, we compute the matrix decomposition using both SDD and SVD. Create query vector, compute the Similarity and rank the document in descending order of similarity.
Experiments and Results 1 22 0.845110 2 3 0.186189 3 49 0.164403 4 58 0.157444 5 56 0.148811 6 1 0.139891 7 9 0.105001 8 69 0.067863 9 31 0.057763 10 23 0.056919 Figure 4. Top ten entries in SVDFigure 5. Top ten entries in SDD 1 22 0.741998 2 1 0.401568 3 49 0.399059 4 3 0.398177 5 58 0.397571 6 23 0.396199 7 9 0.394032 8 12 0.391085 9 14 0.389590 10 31 0.380483
Web Applications • We developed a web-based application for the presentation of this Chinese information retrieval systems. • Visiting this side by using the address of http://pc110.narc.affrc.go.jp/Chinese/. • We also developed a Japanese system using SDD-base VSM. The web interface shows at the address of http://pc110.narc.affrc.go.jp/AgrInfo/.
Conclusion • We presented a Chinese Information retrieval system by using SDD. • SDD has good advantage in saving storage of computer resources. • SDD will be easy to implement for a big data collection. • SDD will be easy to accomplish the cross language retrieval. • SDD has almost the same retrieval performance compared with SVD.