220 likes | 346 Views
Domain-Specific Iterative Readability Computation. Jin Zhao 13/05/2011. Domain-Specific Resources. Domain-Specific Resources. Domain-specific resources targets at varying audiences. Modular arithmetic page from Wikipedia. Modular arithmetic page from Interactivate.com.
E N D
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011
Domain-Specific Resources WING, NUS
Domain-Specific Resources Domain-specific resources targets at varying audiences. Modular arithmetic page from Wikipedia Modular arithmetic page from Interactivate.com WING, NUS
Challenge for a Domain-Specific Search Engine How to measure readability for domain-specific resources? WING, NUS
Literature Review • Heuristic-based Readability Measures • Weighted sum of text feature values • Examples: • Flesch Kincaid Reading Ease (FKRE): [Flesch48] • Dale-Chall readability formula: [Dale&Chall48] Quick and indicative but often oversimplify WING, NUS
Literature Review • Natural Language Processing and Machine Learning Approaches • Extract deep text features and use supervised learning methods to generate models for readability measurement • Text Features • Unigram [Collins-Thompson04], Parse tree height [Schwarm05], Discourse relations [Pitler08] • Supervised learning techniques • Support Vector Machine (SVM) [Schwarm05], k-Nearest Neighbor (KNN) [Heilman07] More accurate but annotated corpus required and ignorant of the domain-specific concepts WING, NUS
Literature Review • Domain-Specific Readability Measures • Derive information of domain-specific concepts from expert knowledge sources • Examples: • Open Access and Collaborative Consumer Health Vocabulary [Kim07] • Medical Subject Headings ontology [Yan06] • Handles domain-specific concepts but expert knowledge sources are still expensive and not always available Key qualities of a goodreadability measure: effective, portable and domain-aware. WING, NUS
Intuitions • Use an iterative computation algorithm to estimate these two scores from each other • Example: • Pythagorean theorem vs. ring theory A domain-specific resource isless readable if it contains more difficult concepts A domain-specific concept is more difficult if it appears in less readable resources WING, NUS
Iterative Computation (IC) Algorithm • Graph Construction • Construct a graph representing resources, concepts and occurrence information • Score Computation • Initialize and iteratively compute the readability score of domain-specific resources and the difficulty score of domain-specific concepts • Two versions: heuristic and probabilistic • Required Input • A collection of domain-specific resources • A list of domain-specific concepts WING, NUS
Graph Construction Resource 1 Concept List …Pythagorean theorem can be written as a2 + b2 = c2, where c represents the length of the hypotenuse… … right triangle Pythagorean theorem hypotenuse sine function cosine function … Resource 2 …The sine function (sin) can be defined as the ratio of the side opposite the angle to the hypotenuse… Resource 2 Resource 1 right triangle Pythagorean Theorem hypotenuse sine function cosine function WING, NUS
Score Computation (Heuristic) 2.00 4.00 3.00 1.00 • Initialization • Resource Node (FKRE) • Concept Node (Average score of its adjacent nodes) Resource Nodes w x y z Concept Nodes a b c Initialization 2.00 2.50 3.00 • Iterative Computation • Each node(Original score + average of the original scores of its adjacent nodes) 3.00 5.25 4.75 7.00 Resource Nodes w x y z Concept Nodes a b c Iteration 1 4.00 6.00 5.00 WING, NUS
Score Computation (Heuristic) 9.75 10.25 13.00 7.00 Resource Nodes w x y z Concept Nodes a b c Iteration 2 8.13 10.00 11.88 15.13 18.82 21.19 24.88 • Termination Condition • The rank order of the resource nodes stabilizes Resource Nodes w x y z Concept Nodes a b c Iteration 3 23.51 20.00 16.51 WING, NUS
Score Computation (Heuristic) • Single-valued score for each node • Unable to handle concepts of varying difficulties • Simple averaging in score computation • Difficult to incorporate sophisticated computational mechanisms WING, NUS
Score Computation (Probabilistic) • Initialization • Resource Node (Sentence Sampling) • Concept Node (Resource Sampling) Resource Nodes w x y z Concept Nodes a b c Initialization
Score Computation (Probabilistic) • Iterative Computation • Modified Naïve Bayes Classification Original: Direct Adaptation: Modified: Resource Nodes Concept Nodes
Evaluation • Key qualities of a good readability measure • Effectiveness • Portability • Domain-awareness WING, NUS
Effectiveness • Corpus of Math Webpages • Metrics: • Pairwise accuracy • Spearman’s rho • Baseline: • Heuristic • FKRE • Supervised learning • NB, SVM, MaxEsnt • Binary concept features only WING, NUS
Portability • Different selection strategies • Resource selection at random • Concept selection at random • Resource selection by quality • Concept selection by TF.IDF • Performance measurement at 5 levels • 20%, 40%, 60%, 80% and 100% of the original resource collection / concept list WING, NUS
Portability Concept Selection Strategies Resource Selection Strategies WING, NUS
Portability WING, NUS
Domain-awareness • Handling of domain-specific concepts • Simple yet effective • Concepts of multiple difficulty levels? • Converge to single value even in PIC • Splitting? (K-Means, GMM, etc.) • Other computational mechanisms? WING, NUS
Conclusion • Iterative Computation • Estimate the readability of domain-specific resources and difficulty of domain-specific concepts in a iterative manner • Effective, Portable and Domain-aware • Future Work • Handling of concepts of multiple difficulty levels WING, NUS