1 / 43

SMC’13

SMC’13. “Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer Infodemiology Georgia Tourassi, PhD. 2013 Smoky Mountains CSE Conference Gatlinburg, TN September 5, 2013 . Environmental Cancer Risk and Migration Pattern. PIs: Georgia Tourassi / Songhua Xu.

van
Download Presentation

SMC’13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMC’13 “Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer InfodemiologyGeorgia Tourassi, PhD 2013 Smoky Mountains CSE ConferenceGatlinburg, TN September 5, 2013

  2. Environmental Cancer Risk and Migration Pattern PIs: Georgia Tourassi / Songhua Xu

  3. Environmental Cancer Risk and Migration Pattern

  4. Infodemiology • “The epidemiology of digital (mis)information” • “The Internet has made measurable what was previously immeasurable: The distribution of health information in a population, tracking (in real time) health information trends over time, and identifying gaps between information supply and demand.” • G Eysenbach, Am J Med 2002

  5. Infodemiology in Action http://www.google.org/flutrends/video/GoogleFluTrends_USFluActivity.mov

  6. Applications Areas Detecting and quantifying disparities in information availability Monitoring public health relevant publications on the Internet Tracking effectiveness of health marketing campaigns Monitoring health related behaviors Syndromicsurveillance Unknown drug side effects and complications ….

  7. Social Media Use among Internet Users Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.

  8. Social Media Use among Internet Users Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.

  9. Cancer Community • One in five internet users with cancer • A growing number of cancer patients share online • their personal stories regarding their symptoms, treatments, emotional and physical concerns, and many other issues arising throughout the cancer diagnosis, treatment, and recovery phases. • Promising potential of knowledge discovery via analyzing user generated content in online cancer communities.

  10. CASE STUDY 1 Parity and Breast Cancer Risk

  11. Case-Control Study Knowledge Discovery Childbirth Childbirth No Childbirth No Childbirth Cases with Breast Cancer Population Controls without Breast Cancer

  12. Conventional DataCollection Hospitals Organizations Institutes

  13. Proposed DataCollection Obituaries On-Line Obituaries

  14. Web Crawling and Text Parsing WebCrawler LocalNewspaperWebsites

  15. Information Retrieval - Age

  16. Information Retrieval - Gender

  17. Information Retrieval - Childbirth

  18. Information Retrieval - Cause of Death

  19. Data Collection • Obituaries published online 2000-2012 • 59,002 w/ “breast cancer” • 50,927 w/out “breast cancer” • After “cleaning” • 20,332 case group • 15,946 w/ at least one biological child • 15,954 control group • 13,548 w/ at least one biological child

  20. Case and Control Groups 20,332 women 15,946 with children 78.4% 15,954 women 13,548 with children 84.9%

  21. Childbirth Incidence

  22. Odds Ratio • Ages from 30-69 Years Old • Age-Adjusted by 2010 US Standard Population • Odds of Childbirth Incidence in the Case Group: • 13643 / 4284 = 3.2 • Odds of Childbirth Incidence in the Control Group: • 6556 / 1545 = 4.2 • Odds (of Childbirth Incidence) Ratio = 0.74, CI:(0.69,0.79)

  23. Reliability?

  24. Number of Children & Breast Cancer Risk

  25. Sample Size

  26. Discussion • Limitation of obituaries • Cannot derive effect of additional factors (e.g., age at first pregnancy, breastfeeding, lifestyle choices) • Other types of online patients’ personal life stories can overcome these limitations

  27. CASE STUDY 2 Geospatial Cancer Mortality Trends in the US

  28. Web Mining for Deriving Geospatial Cancer Mortality Trends in the US • Collecting, compiling, and reporting the related surveillance statistics is a time consuming process introducing substantial delays in the monitoring process. • We propose to study whether general cancer mortality trends can be adequately captured by automated analysis of text content found in online obituaries published in US newspapers.

  29. Method Overview • We implemented a obituary crawler to collect large number of obituaries from online local newspapers. • We implemented a rule-based natural language system to transform the collected obituary documents into a structured format. • We applied two correction factors to account for anticipated biases of the statistics derived from the collected dataset. • We compare statistic reports derived from the collected obituary dataset with the cancer mortality statistics reports published by SEER to show that we can generate more accurate cancer mortality reports from the collected dataset.

  30. Web Web Web System Architecture Sequential Crawler Sequential Crawler Sequential Crawler Random Crawler Random Crawler Random Crawler Web Crawling Dictionary Dictionary Dictionary rdm_sm rdm_sm rdm_sm kwd_bc kwd_bc kwd_bc rdm_lg rdm_lg rdm_lg kwd_lc kwd_lc kwd_lc Html Documents Html Documents Html Documents dic_age dic_age dic_age Metadata Reference Metadata Reference Metadata Reference Content Extraction Module Content Extraction Module Content Extraction Module Context Extraction Module Context Extraction Module Context Extraction Module Pre-processing dic_year dic_year dic_year Extracted Context Extracted Context Extracted Context Obituary Content Obituary Content Obituary Content Rule-based Context Inference Module Rule-based Context Inference Module Rule-based Context Inference Module dic_gender_male dic_gender_male dic_gender_male Context Enriching Module Context Enriching Module Context Enriching Module newspaper_reference newspaper_reference newspaper_reference Enriched Context Enriched Context Enriched Context Natural Language Processing Exact Dictionary-based Chunking Module Exact Dictionary-based Chunking Module Exact Dictionary-based Chunking Module dic_gender_female dic_gender_female dic_gender_female Context Integration Module Context Integration Module Context Integration Module Rule-processing Module Rule-processing Module Rule-processing Module Integrated Context Integrated Context Integrated Context Rule Rule Rule RDBMS RDBMS RDBMS Raw Database Raw Database Raw Database Database Module Database Module Database Module Inferred Metadata Inferred Metadata Inferred Metadata age_at_death age_at_death age_at_death ID Assignment Module ID Assignment Module ID Assignment Module Database Processing Cleansed Database Cleansed Database Cleansed Database Data-cleansing Module Data-cleansing Module Data-cleansing Module year_of_death year_of_death year_of_death gender gender gender Statistical Analysis Module Analysis Module Statistical Analysis Module Data Analyzing breast_cancer/lung_cancer breast_cancer/lung_cancer breast_cancer/lung_cancer Statistics Report Statistics Report Statistics Report

  31. Data Collection • Obituary Crawler • Based on an online obituary search engine, ObitFinder • Serviced by Legacy.com, one of the largest online obituary providers for the US newspaper industry • 1,100 newspapers, 2005-2009 (200+ GB) • Covering 46 US states (AR, ND, WV, HI, WY excluded) • Data • Random selection • 3,572,122 online obituary articles

  32. Data Analysis • Anticipated Biases • The number of cancer-related obituaries could be biased due to different prevalence of obituaries for different age groups or states • The proportion of obituaries including cause of death could bias the number of cancer-related obituaries • Correction Factors • Referencing the statistics from the CDC Deaths Final Report (2005-2009) • Incorporating cultural “openness” factor of a particular age group or a state

  33. Correction Factors • Adjustment Ratio 1 (Age-based Obituary Distribution across States) • Age-based obituary distribution over states may be different from age-based death distribution over states • E.g., In the case of Tennessee, [#Obituary(TN)/#Obituary(US)] is 0.86%, but [#Death(TN)/#Death(US)] is 2.43% • Adjustment Ratio 1: for TN is 2.43/0.86 = 2.84 • We can compute adjustment ratio 1 for each state • Adjustment Ratio 2 (Obituary Content Richness) • Proportion of Obituaries which include cause of deaths may be different depending on states • http://en.wikipedia.org/wiki/List_of_causes_of_death_by_rate • E.g., In the case of California, 20.7 % of obituaries include cause-of-death related terms; however, only 5.2 % of Alabama obituaries include cause-of-death related terms. • Adjustment Ratio 2: for CA is 5.2/20.7 = 0.13 • We can compute adjustment ratio 2 for each state

  34. Case Study 1: Breast Cancer6,935 female subjects

  35. Case Study 1: Breast Cancer6,935 female subjects

  36. Reliability?

  37. Case Study 1: Lung Cancer5,312 subjects

  38. Case Study 1: Lung Cancer5,312 subjects

  39. Reliability?

  40. Conclusions • Cancer mortality trends can be captured reliably in a time-efficient, cost-effective, and fully automated way by mining content that is openly available on the Internet. • Using breast and lung cancer as case studies, we observed that the trends discovered via web mining were very similar to those reported by NCI. • Proposed correction factors are useful to account for anticipated biases of statistics from obituary datasets.

  41. Summary

  42. Conclusion

  43. Thank you Georgia Tourassi, PhD (tourassig@ornl.gov)Songhua Xu, PhD (xus1@ornl.gov)

More Related