330 likes | 449 Views
Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion. Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong. Outline. Introduction to Readability/Conceptual Difficulty Motivation Related Work
E N D
Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion ShoaibJameel, Wai Lam and XiaojunQian The Chinese University of Hong Kong
Outline • Introduction to Readability/Conceptual Difficulty • Motivation • Related Work • Our method (Sequential Term Transition Model (STTM)) • Empirical Evaluation • Conclusions and Future Work
1 http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 2 http://en.wikipedia.org/wiki/Proton
Search for a keyword Results – Sometimes irrelevant and mixed order of readability
Our Objective Query Retrieve web pages (considering relevance) Automatically accomplished Re-rank web pages based on readability
What has been done so far? • Heuristic Readability formulae • Unsupervised approaches • Supervised approaches
Heuristic Readability Methods • Have been there since 1940’s • Semantic Component– Number of syllables per word, length of the syllables per word etc. • Syntactic Component– Length of sentences etc.
Example – Flesch Reading Ease water -> wa-ter proton -> pro-ton embryology -> em-bry-ol-o-gy star -> star Problem Syntactic component Semantic component Manually tuned numerical parameters
Supervised Learning Methods • Language Models • Unigram Language Model based method • SVMs (Support Vector Machines) • Use of query Log and user profiles • Can address the problem on individual basis
Smoothed Unigram Model [1] • Recast the well-studied problem of readability in terms of text categorization • and used straightforward techniques from statistical language modeling. [1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).
Smoothed Unigram Model Limitation of their method: Requires training data, which sometimes may be difficult to obtain
Domain-specific Readability • Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10). Based on web-link structure algorithm HITS and SALSA. • Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). Based on an ontology. Tested only in the medical domain Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis I will focus on this work.
Overview • The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts. • The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH). • Authors have pointed out the readability based formulae are not directly applicable to web pages.
MeSH Ontology Concept difficulty increases Concept difficulty decreases
Overall Concept Based Readability Score where, DaCw = Dale-Chall Readability Measure PWD = Percentage of difficult words AvgSL = Average sentence length in di len(ci,cj)=function to compute shortest path between concepts ci cj in the MeSH hierarchy N = total number of domain concepts in document di Depth(ci)=depth of the concept ci in the concept hierarchy D= Maximum depth of concept hierarchy Number of associations = Total number of mutual associations among concepts Their work focused on word level readability, hence considered only the PWD
Use of Query Log data • Have been conducted by the search engine companies • Requires proprietary data, not available publicly • Thus not very useful to the research community because it cannot be replicated J. Kim, K. Collins-Thompson, P. N. Bennett, S. Dumais. Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Proceedings of WSDM 2012. (Microsoft Research) Chenhao Tan, EvgeniyGabrilovich, and Bo Pang. 2012. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. (Yahoo! Research)
Our approach • Sequential Term Transition Model (STTM) • A conceptual difficulty determination model which is: • Unsupervised • Does not require any knowledge base or annotated data
Methodology • We first build a term document matrix • We then perform Singular Value Decomposition (SVD) on the matrix • SVD : W≈W’=USVT • U is a Txf matrix of left singular vectors • V is a Dxf matrix of right singular vectors • S is a fxf diagonal matrix of singular values • T is the number of terms in the vocabulary • D is the number of documents in the collection • f is number of factors
Observation in the SVD space • Terms which are central to a document come close to their document vectors • General terms are distant away from their document vectors • Semantically related terms cluster close to each other • Unrelated terms cluster away from each other
Computing Term Difficulties Normalized term vector Normalized document vector Matrix of normalized document vectors that contain the term
General Idea about Linear Embedding D6 D1 w6 w1 D2 w2 t w5 w3 D5 D3 w4 D4
Cohesion • When units tend to “stick together”, the property is called cohesion • We compute cohesion between terms in sequence • The more cohesive terms in the document are, the easy it is for a person to comprehend a discourse
Computation of Cohesion • We know related terms cluster close to each other in the latent space obtained via SVD • We have to compute the cluster memberships of each of the terms as SVD does not directly give term memberships to clusters • We use k-means because of its simplicity and ability to handle large datasets
How we compute cohesion? W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Determine the cluster memberships of the two consecutive terms w1 and w2 W1 W2 C1 C1 Same cluster, we conclude they are cohesive W1 C1 W2 W3 C1 C1 C4 Same cluster, we conclude they are cohesive W1 W2 W3 W4 C1 C1 C4 Compute cosine similarity
Cohesion using cosine similarity • If the cluster centroids are close to each other, then cosine similarity will be high • When cosine similarity is high means that the two cluster are closely related
Conceptual Difficulty Score Conceptual difficulty score for document j Cohesion score of document j Term difficulty score for document j Parameter controlling the relative weights between [0,1]
Empirical Evaluation - Dataset • Standard test collections do not have readability judgments • We chose Psychology domain • Crawled web pages from Wikipedia, Psychology.com, Simple English Wikipedia • Total web page count = 167,400 • No term stemming • Tested with both stopwords and no stopwords
Retrieval of web pages • Indexed the web pages using a small scale search engine. We used Zettair • Retrieved web pages for a query based on relevance • Followed INEX’s query/topic generation guidelines • Re-ranked web pages based on conceptual difficulty • Annotated some top-10 documents for each query
Evaluation Metric • Normalized Cumulative Discounted Gain (NDCG) • We suited for ranking evaluation because it takes into account the position of an entity in the ranked list unlike Precision, recall measures or Rank order correlation
Conclusions and Future Work • We proposed a conceptual difficulty ranking model • Required no training data or ontology • Main novelty – use of a conceptual model • Significant improvement • In the future, we would study how link-structure of the web could aid us in conceptual difficulty ranking