1 / 31

Effects of overlaying ontologies to TextRank graphs

Effects of overlaying ontologies to TextRank graphs. Project Report By Kino Coursey. Outline. Introduction & Background Ontology based Summarization Evaluation Discussion Future Work Conclusion. Motivation. An exponentially increasing volume of information requires summarization

von
Download Presentation

Effects of overlaying ontologies to TextRank graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey

  2. Outline • Introduction & Background • Ontology based Summarization • Evaluation • Discussion • Future Work • Conclusion

  3. Motivation • An exponentially increasing volume of information requires summarization • Humans are finite • Text is being generated faster than a reader can read • Need to quickly identify the relevance of documents

  4. Central Question: Does knowing more really help? • TextRank and a number of other random walk NLP algorithms have been applied to different areas like text summarization and keyword extraction. • How would additional information from an ontology like WordNet or Cyc would affect such algorithms. Would it be better or worse?

  5. Evaluation Criteria • The evaluation criteria would be the change in performance of TextRank when given the extra information. • The evaluation dataset will be the Document Understanding Conference 2002 (DUC-2002) summarization test set • The ROUGE summarization evaluation tool will be used to measure performance change

  6. Project Plan • Implement TextRank • Construct a algorithm to import data from Cyc into TextRank • Construct evaluation dataset preprocessor • Develop a parameter tuning process • Measure performance with optimal parameters • Analyze and report results

  7. Implementation • Implemented Intelligent surfer model in Perl • Implemented text-to-Cyc graph extraction • Denotation map • Using: isa, genls, conceptuallyRelated, mainDomain, definingMt • Explored graph visualization technology (easier to debug what you can see) • Nodes3d from BrainMaps.org

  8. Ontology Based Summarization • Augment TextRank with Cyc relationships • Perform initial context free mapping into Cyc Terms • Perform Ranking process • Select the highest ranked sentences as extractive summary

  9. Intelligent Surfer Model The Standard Model For all nodes use --> Intelligent Surfer Model For all nodes use  Constraint on Si Si apportioned as a function of query relevancy. Here words in the input text have Si = 1/N while all other nodes have Si =0. When you get tired you jump back to the “problem statememt” , the input.

  10. Weighted Version Sum of the outputs Weighted updates  Summation of the weighted outputs of the currently ranked nodes

  11. From text to Cyc graph • Text-to-Cyc graph extraction • Denotation map • Using: isa, genls, conceptuallyRelated, mainDomain, definingMt • Each edge has its own weight associated with it • Finding the right weight is its own process

  12. Finding the right terms (denotation-mapper "Hurricane Gilbert swept toward the Dominican Republic Sunday") Results : (("Hurricane" . HurricaneAsObject)    ("Hurricane" . HurricaneAsEvent)    ("Gilbert" . JohnGilbert)    ("Gilbert" . JodyGilbert)    ("Gilbert" . MelissaGilbert)    ("Gilbert" . GilbertStuart-TheArtist)    ("Gilbert" . GilbertGottfried)    ("swept" . SweepingAnArea)    ("swept" . (ThingDescribableAsFnSweep-TheWordAdjective))    ("toward" . (HypothesizedPrepositionSenseFnToward-TheWordPreposition))    ("the Dominican Republic" . DominicanRepublic)    ("Sunday" . wikip-Sunday)    ("Sunday" . (ThingDescribableAsFnSunday-TheWordAdjective)))

  13. The Big View

  14. Tuning the system with Genetic Algorithms A Steady State Genetic Algorithm was used to find an optimal weighting compared against ROUGE-S on a subset of documents.

  15. Genetic Algorithm & Evaluation Function • Select k members for tournament (here k=4). • For all members in tournament evaluate performance on the task and compute fitness. • Perform tournament selection by sorting based on fitness and creating a parent set and a replacement set. • Copy parents over replacement set to make children. • Do mutation and crossover operations on children. • Go to step 1.

  16. Initial GA Evaluation Document TextRank OntoRank Ratio 1 0.0918 0.0952 1.0370 2 0.4095 0.3937 0.9612 3 0.2035 0.1991 0.9787 4 0.2687 0.2823 1.0506 5 0.0546 0.0588 1.0769 6 0.1778 0.2222 1.2500 7 0.3025 0.4034 1.3333 8 0.2507 0.2507 1.0000 9 0.1000 0.0952 0.9524 10 0.1685 0.1575 0.9348 AVG 1.0575 GA was run on a random subset of documents that scored below average with default settings, and was run until it provided a +5.75% gain over TextRank on the ROUGE-S scores.

  17. Combined Ranking: HurricanAsObject vs. Hurricane as Event Commonsense distinctions that vary from an ontology like WordNet. HurricaneAsObject: “Hurricane Gilbert moved to the north …” HurricaneAsEvent: “During Hurricane Gilbert many trees were …

  18. Combined Ranking: Many Gilberts but one hurricane topic …. • Gilbert is an ambiguous word for Cyc • Yet the words primary connections are topic related • Similar to human name association in context

  19. EVALUATIONS • Initial GA scores showed a +5% improvement • Evaluation on the whole dataset • Shocking Revelation • Re-Evaluation

  20. First Full evaluation • Performed full per-document evaluation on DUC-2002 • Carried out detailed per-document review of relative performance using ROUGE-S

  21. Disappointing full dataset performance

  22. Debugging via Histogramming • Sorted the relative performance on a per-document basis • High variance, with average positive effect +15% and average negative effect -14% • Unfortunately more often negative than positive, so a net negative skew

  23. Revelation • While working on a distributed version of TextRank discovered the two datasets in DUC-2002 • The per-document generative summary • The multi-document extractive summary • Of course the system was using the generative summary to evaluate an extractive system ! • Convert and Re-Test on the multi-document dataset • No time to re-evolve using the GA for the multi-document data

  24. Multi-document Re-Evaluation

  25. Evaluation Conclusions • Much more encouraging when comparing same data types • Initial weakness prompted analysis of negative result leading to theory covered in discussion • No breakthrough

  26. Discussion • Adding the commonsense graph produces wide variation in TextRank performance both positive and negative. • TextRank tries to preserve the total information present in a graph • Adding commonsense to the graph can identify what a reader should be interested in as well as what they probably already know • In the first case there is an improvement : disambiguation and context are selected • In the second you transmit redundant information … common sense, and reduce the effective bandwidth of the summary

  27. Discussion • Identification of stopconcepts • The ontology version of stopwords • Nodes that have so much connectivity that they contain little information • Created a stopconcepts list

  28. Future Work • Run the GA on the multi-document data set • Develop the ability to detect novel information from redundant information • The Ontology ranking process itself is useful • Ontological debugging • Familiarization with the language of the ontology via a form of parallel text

  29. Conclusions • Adding commonsense graphs to TextRank can affect the performance both positively and negatively • Need to identify how to modulate the effects of commonsense information • Having the right data helps! • Spin-offs for the text-to-ontology graph can be useful

  30. References • [Richardson and Domingos 2002] Richardson and Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, NIPS 2002 • [Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 • [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004 • [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Engineering with WordNet and PageRank, in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005 • [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based ranking algorithms for text processing, Patent application #20050278325 • [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Document Summarization with Iterative Graph-based Algorithms, Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005

  31. References • [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking the Importance of Boards of Directors. • [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. • [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a Khepera robot using genetic programming," Cybernetics and Control, Vol. 26, No. 3, pp. 533- 561, 1997. • [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September 2004. • [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World Wide Web Conference, Brisbane, Australia, 1998. http://citeseer.nj.nec.com/brin98anatomy.html • [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages 652--659, 2004.

More Related