340 likes | 464 Views
Discover Emerging and Novel Research Topics. TopicTrend. By: Jovian Lin. Introduction. Formulating a research idea is the 1 st step for success in academia. A worthy research idea must be original and innovative .
E N D
Discover Emerging and Novel Research Topics TopicTrend By: Jovian Lin
Introduction • Formulating a research idea is the 1st step for success in academia. • A worthy research idea must be original and innovative. • In order to come up with innovative research ideas, researchers have to read a lot of published articles… • … which is time-consuming.
“No.” “Is there any shortcut to success?” “There are efficient ways to achieve success” Search Engines in Digital Libraries:
Introduction • Search engines support information seeking and retrieval. List of titles (of articles) Search Engine “Search Query”
Search Results
Introduction Howusefulis this result tothejunior researcher? • Search engines support information seeking and retrieval. • However, is this enough for the junior researcher? FYP students 1st year PhD students • Define a research topic (from zero knowledge) • Help in survey • Identify emerging/new research areas to explore • Determine related topics
Problem Definition • Junior researchers want: • Understand research topics andtrends. • RecognizeHOTtopics. • Understand how topics interactand influenceresearch activity.
Problem Definition • Junior researchers want: • Understand research topics and trends. • RecognizeHOTtopics. • Understand how topics interactand influenceresearch activity. Current InefficientMethod Enter a search query Extract new terms fromselected article View results Select a few articles to read
Search Results
Problem Definition • Junior researchers want: • Understand research topics and trends. • RecognizeHOTtopics. • Understand how topics interactand influenceresearch activity. CurrentInefficient Method Enter a search query Extract new terms fromselected article View results Select a few articles to read
Problem Definition • Junior researchers want: • Understand research topics and trends. • RecognizeHOTtopics. • Understand how topics interactand influenceresearch activity. DesiredEfficient Method Enter a search query View results TopicTrend List of HOTresearch topics (related to the search query) Do it quick! Visualization of the research topics
Evaluation • Recruited 4 participants. • Participants: • Tested TopicTrendusing queries from their respective domains. • RatedTopicTrend’s output (w.r.t. their query). [Quantitative] • Filled up a questionnaire. [Qualitative] • Chemistry / PhD • Engineering (Transportation) / PhD • Comp Science (AI) / PhD • Engineering / FYP
Evaluation “machine learning” Topic H Topic A 1 Topic I Topic B 0 Topic G Topic C 1 Topic D 1 Topic J Topic E 1 Topic F 1 Topic F Topic A Topic G 1 Topic H 1 Topic I 1 Topic B Topic J 1 Topic E Score 9/10 Topic C Topic D
Evaluation Quantitative Average score = 68.125%
Evaluation Qualitative • Questionaire using Five-Point Likert Scale. • 1=Disagree, 5 =Agree. • Some examples: • “The system was easy to use.” • “The system gave interesting results.” • “I was able to get a better understanding of the topics.” • “I was able to discover trends.” • “I was able to discover relationships between topics.” • “I was able to discover potential, novel topics.” • Details in Project Report. 4.75 / 5 4 / 5 4 / 5 4 / 5 4 / 5 4 / 5
Conclusion • TopicTrend is a visualization tool that helps junior researchers: • Understand research topics and trends. • RecognizeHOTtopics. • Understand how topics interact and influenceresearch activity. • However, results were mediocre • Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc) • Solutions and Future Work: • TF-IDF weight — don’t have to manually enter stop words. • Statistical measure to evaluate how important a word is. • The importance increases to the number of times a word appears in the document... • But is offset by the frequency of the word in the corpus. • Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei) • Online LDA – find topics fasterthan normal LDA; analyze in a stream. • Dynamic Topic Models (DTM) – captures the word evolution of each topic over time. • Search by exemplar (instead of search by keyword) • Benefits users who have difficulty expressing their query.
Conclusion • TopicTrend is a visualization tool that helps junior researchers: • Understand research topics and trends. • RecognizeHOTtopics. • Understand how topics interact and influenceresearch activity. • However, results were mediocre • Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc) • Solutions and Future Work: • TF-IDF weight — don’t have to manually enter stop words. • Statistical measure to evaluate how important a word is. • The importance increases to the number of times a word appears in the document... • But is offset by the frequency of the word in the corpus. • Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei) • Online LDA – find topics faster than normal LDA; analyze in a stream. • Dynamic Topic Models (DTM) – captures the word evolution of each topic over time. • Search by exemplar (instead of search by keyword) • Benefits users who have difficulty expressing their query.
Implementation • OpenNLP— a machine learning based toolkit for the processing of natural language text. • Used OpenNLPto retrieve a list of NPs. NP A OpenNLP Tools NP B An article NP C NP D NP E NP F Sentence Detection Tokenization Part-of-Speech (POS) Tagging Chunking and Retrieving NPs
Implementation • Sentence Detection Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Those contraction-less sentences don't have boundary/odd cases...this one does. • Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. • Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. • Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. • Those contraction-less sentences don't have boundary/odd cases...this one does.
Implementation • Tokenization • Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. • Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. • [Pierre] [Vinken] [,] [61] [years] [old] [,] [will] [join] [the] [board] [as] [a] [nonexecutive] [director] [Nov.] [29] [.] • [Mr.] [Vinken] [is] [chairman] [of] [Elsevier] [N.V.] [,] [the] [Dutch] [publishing] [group] [.]
Implementation • Part-of-Speech Tagging • Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. • Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. • [NNP] [NNP] [,] [CD] [NNS] [JJ] [,] [MD] [VB] [DT] [NN] [IN] [DT] [JJ] [NN] [NNP] [CD] [.] • [NNP] [NNP] [VBZ] [NN] [IN] [NNP] [NNP] [,] [DT] [JJ] [NN] [NN] [.]
Implementation • Text Chunking and Extracting NPs • Text chunking consists of dividing a text in syntactically correlated parts of words. • Uses the Tokenization and POS Tagging data. • For example:He reckons the current account deficit will narrow to only # 1.8 billion in September.Becomes:[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] .
Implementation • Text Chunking and Extracting NPs • Text chunking consists of dividing a text in syntactically correlated parts of words. • Uses the Tokenization and POS Tagging data. • Note the: • B-Chunk • I-Chunk
Implementation • OpenNLP— a machine learning based toolkit for the processing of natural language text. • Used OpenNLPto retrieve a list of NPs. NP A OpenNLP Tools NP B An article NP C NP D NP E NP F Sentence Detection Tokenization Part-of-Speech (POS) Tagging Chunking and Retrieving NPs
Implementation • An algorithm to calculate the score of a NP. 1 + 1 10 + 1 Score = Score = 1 + 2 + 10 + 20 10 + 2 + 1 + 20 NP A 10 # (0 ~ 2 years) 3 11 = = = 0.090 = 0.333 NP B # (2 ~ 4 years) 2 33 33 NP C # (4 yrs & beyond) 1 NP D NP E NP F 1 # (0 ~ 2 years) # (2 ~ 4 years) 2 # (4 yrs & beyond) 10
Implementation • An algorithm to calculate the score of a NP. NP A NP B NP C NP D NP E NP F
Implementation • Re-rank the list of NPs base on the score. New! NP A NP B NP B Re-rank NP D NP C NP E NP D NP C NP E NP A NP F NP F
Implementation Calculate the relationship strength between NPs byconsidering the common articles (PIIs) that they have. The more articles they have in common, the thicker the edge.