80 likes | 196 Views
Intro of Dataset used in Dissertation Research. Xiangyu Fan. Research Topic and Used Data. Focused on recommendation of medical info Use medical topic overlap to help improve recommendation Use simulation as research method Use MedicalNewsDaily website as source
E N D
Intro of Dataset used in Dissertation Research Xiangyu Fan
Research Topic and Used Data • Focused on recommendation of medical info • Use medical topic overlap to help improve recommendation • Use simulation as research method • Use MedicalNewsDailywebsite as source • One of the most popular medical news websites • Good categorization by professionals • 123 unique topics (i.e. category) • Each news has one main topic and 0-4 sub topics
Building Dataset Three steps to build dataset: • Select 123 unique topics (i.e. 123 categories in data source) • Crawl 100 recent medical news articles for each topic, store them into DB • Retrieve main and sub topics for each document and store them into DB
Tables in DB • Article Table • Article title, • Article content • Publish date • Source • Topic table • Article ID • Topic name • Topic type (Main vs Sub) 12,300 records (12300 articles) 34,290 records (Freq of topic occurrence) Each article has 2.8 topics on average (1 main topic and ~2 sub topics)
Sample Question 1 • What’s frequently-occurring sub topics in the articles on headache (as main topic)?
Sample Question 2 • What’s topic pairs with the strongest correlation?
Building Simulation Dataset • Topics with Strong Overlap • Select 30 topics with the highest freq of co-occurrence • Average of co-occurrence freq: 63 • Topics with Weak Overlap • Select 30 topics with the lowest freq of co-occurrence • Average of co-occurrence freq: 1