290 likes | 439 Views
Score-based ranking of the documents. Submitted By: Kriti Khanna(9910103499) F4, CSE, 4 th year. OUTLINE . Introduction Literature Survey Objective Flowchart Implementation Tools and techniques References. INTRODUCTION. Information Retrieval. Ranking Weight Score.
E N D
Score-based ranking of the documents Submitted By: Kriti Khanna(9910103499) F4, CSE, 4th year
OUTLINE • Introduction • Literature Survey • Objective • Flowchart • Implementation • Tools and techniques • References
INTRODUCTION • Information Retrieval. • Ranking • Weight • Score
Information Retrieval • We obtain information resources relevant to an information need from a collection of information resources. • It is used to reduce information overload. • Best applications: web search engines, public libraries use IR systems to provide access to books, journals and other documents.
Brief working of IR system • User enters the query in his own language. • Query development function converts the user query into formal query in order to harmonize it with the system's vocabulary of retrieval commands. It is 1 of the important intermediary step that takes place inside the database. • Retrieved data is the complete or incomplete data which later on is being sorted to generate the final resultset.
Ranking • To rank matching documents according to their relevance to a given search query. • We do it by assigning a numerical score to each document based on a ranking function, which incorporates features of the document, the query, and the overall document collection.
Some simple ranking functions • Constant ranking function : the same score is assigned to all documents. • Term frequency ranking function : counting the number of times that each query term occurs in the document, then summing these. • The tf-idf ranking function : computing the product of the term frequency and inverse document frequency for each query term, then summing these. • Okapi BM25 : finding the idf of each query term, then summing these. • Machine-learned ranking formulas, obtained automatically from training data by machine learning methods.
Score Calculation • Score calculation for each document is done by multiplying the weights of each document and the query weight, then summing these.
List of sources • Paper1 : Document similarity search based on manifold ranking of Tex-Tiles. • Paper2 : TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages.
List of sources • Paper 3 : Comparison of rank-based vs score based aggregation for ensemble gene selection. • Paper 4 : Several methods of ranking retrieval systems with partial relevance judgment.
Document similarity search based on manifold ranking of Tex-Tiles • In this paper ranking of documents is done by using the tiling concept. • Conclusion : it improves the retrieval performances based on different retrieval functions. • Authors : Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. • Place : Institute of Computer Science and Technology, Peking University, Beijing 100871, China.
TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages • In this paper textiling is used to divide each document into sub topics is being implemented. • Conclusion : this technique has been useful for many text analysis tasks, including information retrieval and summarization. • Authors : Marti A. Hearst
Comparison of rank-based vs score based aggregation for ensemble gene selection • In this paper there is comparison of rank based and score based aggregation using different techniques (RF, MI, Dev, GM, ROC, PRC, S2N) by applying these techniques on different datasets, subsets. • Conclusion : these 2 aggregation approaches work differently on different rankers. • Authors : David J. ittman, Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano
Several methods of ranking retrieval systems with partial relevance judgment. • This paper demonstrates that precision and recall undergo certain shortcomings when ranking is done with partial relevance judgment. • conclusion : with partial relevance judgment, the evaluated results can be significantly different from the results with complete relevance judgment. • Authors : Shengli Wu and Sally McClean.
Objective • It aims to find documents similar to a query document in a text corpus and return a ranked list of similar documents. • Ranking is done by calculating the query-document score.
Problem statement • Documents are ranked based on standard score calculation i.e using the tf-idf concept. • Formula for weighted tf : {1+log base 10 of (tf), tf > 0 0, otherwise }. • Formula for idf : log base 10 of (N/df). • Another way of ranking the documents is also being studied i.e textiling. Further a precision recall graph will be plotted.
Steps involved • Collection of files • Determining term frequency • Determining document frequency • (query, document ) set • Score calculation based on 4 different techniques.
Description of functions • Main : It calls all other functions by making objects of the subclasses. • remWord : It is used to check if program is reading the files. • deleteWords : It is used to delete the list of stop words from all the files and store the unique words of all files in a separate file.
Description of flowchart functions • countWords : It reads the unique terms from the file and store them in a form of map along with their frequency. • documentFreqVector : It makes a document vector. Corresponding to each term and document it sets 1s or 0s.
Weight Calculation • It differs in documents and queries. • We use ddd.qqq notation to depict this calculation. • Example: lnc.ltn • document: logarithmic tf, no df weighting, cosine normalization • query: logarithmic tf, idf, no normalization
Approaches Anc.btn and anc.ltn approaches
Approaches Nnc.btn and nnc.ltn apporaches
Tools and techniques • NetBeans : it is an integrated development environment (IDE) for developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5.It is also an application platform framework for Java desktop applications and others. The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules. • Java : it is a computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that code that runs on one platform does not need to be recompiled to run on another. Java applications are typically compiled to bytecode (class file) that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2014, one of the most popular programming languages in use, particularly for client-server web applications,
References • Wan,X. Yang, J. Xiao, J. (2001) Document Similarity Search Based on Manifold-Ranking of TextTiles. Institute of Computer Science and Technology, Peking University, Beijing 100871, China. • Hearst, M.A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Xerox PARC, California, USA. • Dittman, DJ. Khoshgoftaar, TM. Wald, R. Napolitano, A. (2013). Comparison of Rank-Based vs. Score-Based Aggregation for Ensemble Gene Selection. Florida Atlantic University, Boca Raton, FL 33431. • Wu, S. McClean, S. Several methods of ranking retrieval systems with partial relevance judgment. School of computing and mathematics, University of Ulster, UK.