1 / 21

CONNECTING THE DOTS BETWEEN NEWS ARTICLES

CONNECTING THE DOTS BETWEEN NEWS ARTICLES . GUIDE : Prof. Amitabha Mukerjee Ankit Modi (10104) Chirag Gupta (10212). Problem Statement. SOURCE. TARGET. ?. S 1 , S 2 , S 3 ,... S n. Problem Statement. SOURCE. S i. S j. TARGET. S 1 , S 2 , S 3 ,... S n. Motivation.

carr
Download Presentation

CONNECTING THE DOTS BETWEEN NEWS ARTICLES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CONNECTING THE DOTS BETWEEN NEWS ARTICLES GUIDE : Prof. AmitabhaMukerjeeAnkitModi (10104)Chirag Gupta (10212)

  2. Problem Statement SOURCE TARGET ? S1, S2, S3,...Sn

  3. Problem Statement SOURCE Si Sj TARGET S1, S2, S3,...Sn

  4. Motivation Problem ?Tackling information overload

  5. Motivation Problem ?Tackling information overloadSeeing bigger picture

  6. Motivation Problem ?Tackling information overloadSeeing bigger pictureNavigate between topics

  7. Motivation Domain ?News browsing : One of primary uses of InternetPolitics, Sports, Entertainment etcSearching for relevant news is difficult

  8. Corpus of news articles from The Hindu a delhi court on wednesday convicted sukhdevpehalwan, the third accused in the 2002 nitishkatara murder case, saying that at the time of the incident he too was “present with convicts vikasyadav and vishalyadav,” currently serving life term in tihar jail. Framework

  9. Corpus of news articles from The Hindu Split into words 45['a', 'delhi', 'court', 'on', 'wednesday', 'convicted', 'sukhdev', 'pehalwan,', 'the', 'third', 'accused', 'in', 'the', '2002', 'nitish', 'katara', 'murder', 'case,', 'saying', 'that', 'at', 'the', 'time', 'of', 'the', 'incident', 'he', 'too', 'was', '\x93present', 'with', 'convicts', 'vikas', 'yadav', 'and', 'vishal', 'yadav,\x94', 'currently', 'serving', 'life', 'term', 'in', 'tihar', 'jail', ''] Framework

  10. Corpus of news articles from The Hindu Split into words Stemming 45['a', 'delhi', 'court', 'on', 'wednesdai', 'convict', 'sukhdev', 'pehalwan,', 'the', 'third', 'accus', 'in', 'the', '2002', 'nitish', 'katara', 'murder', 'case,', 'sai', 'that', 'at', 'the', 'time', 'of', 'the', 'incid', 'he', 'too', 'wa', '\x93present', 'with', 'convict', 'vika', 'yadav', 'and', 'vishal', 'yadav,\x94', 'current', 'serv', 'life', 'term', 'in', 'tihar', 'jail', '']0 Framework

  11. Corpus of news articles from The Hindu Split into words Stemming 29['delhi', 'court', 'wednesdai', 'convict', 'sukhdev', 'pehalwan,', 'third', 'accus', '2002', 'nitish', 'katara', 'murder', 'case,', 'sai', 'time', 'incid', 'wa', '\x93present', 'convict', 'vika', 'yadav', 'vishal', 'yadav,\x94', 'current', 'serv', 'life', 'term', 'tihar', 'jail'] Framework Remove Stop words

  12. Corpus of news articles from The Hindu Split into words Stemming [['delhi', 1], ['court', 1], ['wednesdai', 1], ['sukhdev', 1], ['pehalwan,', 1], ['third', 1], ['accus', 1], ['2002', 1], ['nitish', 1], ['katara', 1], ['murder', 1], ['case,', 1], ['sai', 1], ['time', 1], ['incid', 1], ['wa', 1], ['\x93present', 1], ['vika', 1], ['yadav', 1], ['vishal', 1], ['yadav,\x94', 1], ['current', 1], ['serv', 1], ['life', 1], ['term', 1], ['tihar', 1], ['jail', 1], ['convict', 2]] Framework Frequency of 1-grams.Stored in Histograms Remove Stop words

  13. Corpus of news articles from The Hindu Split into words Stemming Bhattacharyya’s Distance DB = - ln (BC(p,q) ):whereBC(p,q) = x € X Σ (p(x).q(x))1/2 is the Bhattacharyya coefficient Framework Bhattacharyya’s Distance Frequency of 1-grams.Stored in Histograms Remove Stop words Reference: [8]

  14. Corpus of news articles from The Hindu Split into words Stemming Framework Bhattacharya’s Distance. Frequency of 1-grams.Stored in Histograms Remove Stop words Dijkstra’s Algorithm Reference: [7]

  15. Sample Results Warrants issued in Jessica case Notice to VikasYadav Charges framed in Katara case Katara attackers declared absconding Katara case: Sukhdev gets lifer

  16. Sample Results US Forces kill osama Inconceivable that no support in Pak : US Laden buried at sea Osama’s pakistan home is no more Death will break Al-Qaeda

  17. Code Snapshot

  18. Evaluation Reference: [1] Coherence (d1, …,dn) = n-1Σi=1Σw 1(w € di ∩ di+1) Every time a word appears in two consecutive articles, we score a point Drawback : Weak links Coherence (d1, …,dn) = i=1…n-1min Σw 1(w € di ∩ di+1)Minimal transition score

  19. References • [1] DafnaShahaf and Prof. Carlos Guestrin : Connecting the dots between news articles. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2010. • [2] DafnaShahaf , Prof. Carlos Guestrin and Eric Horvitz : Trains of thought-Generating information maps. International World Wide Web Conference (WWW), 2012. • [3] Michael D. Lee, Brandon Pincombe and Matthew Welsh : An Empirical Evaluation of Models of Text Document Similarity. In Proceedings of the 27th Annual Conference of the Cognitive Science Society (2005). • [4] Deept Kumar, NarenRamakrishnan, Richard F. Helm, and Malcolm Potts : Algorithms for Storytelling. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008 • [5] M. ShahriarHossain, Joseph Gresock, Yvette Edmonds, Richard Helm, Malcolm Potts and NarenRamakrishnan. Connecting the Dots between PubMed Abstracts. 2012 • [6]Sergey Brin and Lawrence Page : The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN systems 30 (1998) • [7]http://networkx.github.com/documentation/latest/reference/generated/networkx.algorithms.shortest_paths.weighted.dijkstra_path.html#networkx.algorithms.shortest_paths.weighted.dijkstra_path • [8] http://en.wikipedia.org/wiki/Bhattacharyya_distance

  20. Thank you Questions ?

  21. Other Approaches [5] used Soergel distance to calculate distance between documents and then A* algorithm to find the chain [1] used bipartite graph and the notion of influence to find the chain [2] used notion of m-coherence for evaluation of results Page rank method from [6] to can be used to find a chain

More Related