1 / 26

Text Metadata Mining: Exploring its potential*

Padmini Srinivasan School of Library & Information Science The University of Iowa Iowa City, IA padmini-srinivasan@uiowa.edu *Students:Aditya Sehgal, Xin Ying Qiu. Text Metadata Mining: Exploring its potential*. Outline. 1. Text Mining 2. Metadata-based Topic profiles

barkley
Download Presentation

Text Metadata Mining: Exploring its potential*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Padmini Srinivasan School of Library & Information Science The University of Iowa Iowa City, IA padmini-srinivasan@uiowa.edu *Students:Aditya Sehgal, Xin Ying Qiu Text Metadata Mining: Exploring its potential*

  2. Outline 1. Text Mining 2. Metadata-based Topic profiles 3. Function: Exploring topic characteristics via profiles Problem: Study disease research prevalence 4. Conclusions

  3. 1. Text Mining: Novelty and Usefulness • Assist researchers with hypothesis generation, • exploration, and testing. • Discover knowledge that is ‘novel’ at • least relative to the text collection • Discover knowledge that is potentially • ‘useful’ • Extract patterns, explore relationships • Propositions/Hypotheses: need follow up • verification

  4. Examples Of all 45 studies in Medline on chemical X, 80% have been done in the context of disease L, 10% disease M and the remainder in the context of disease N. Gene A is known to be associated with disease X. The literature suggests that gene B shows some key ‘similarities’ to A and therefore B may also be associated with X.

  5. Metadata in Digital Libraries Support content organization and management Provide access to content Dublin Core Metadata Initiative RDF: Resource Description Framework Library of Congress Subject Headings (LCSH) Medical Subject Headings (MeSH) Question: Can we use metadata for text mining and knowledge discovery? Given a topic, eg. ‘Toxic waste’ and a collection of texts such as Medline..

  6. Metadata for Text Mining Describe topics: topic profiles built from the text collection being mined ~ metadata profiles • - Compare topics via their profiles: • a. topic similarity • b. trends over specific features/characteristics • Look for indirect links between topics • Given a topic look for related topics.

  7. Example MEDLINE Record MeSH Qualifier MeSH Phrase

  8. Chemical Semantic Types Genetic Function (134) Organic Chemical Protein Isoprenylation Aldehydes MeSH Metadata (22,000) Formaldehyde

  9. 2. Topic Profiles A set of terms that characterize the topic with weights assigned to represent their relative importance. {Medline: A vector of MeSH term vectors - one for each of the134 semantic types.}

  10. Topic: “hip fractures in the elderly” Search against Pubmed: (geriatrics or elderly) AND hip fractures Extract MeSH metadata terms from retrieved documents Build weighted profile: vector of vectors can be limited to MeSH terms of particular semantic types

  11. Example Profile: Raynauds disease

  12. Comparing topics via their profiles Topic 1: PubMed search Topic 2: PubMed search documents documents MeSH Profile MeSH Profile 13,000 genes (cosine similarity)

  13. Comparing topics - studying particular characteristics in their profiles Problem: To study the prevalence of disease research. ‘geographical context’.

  14. Topic: “cholera” Search against Pubmed: Extract MeSH metadata terms from retrieved documents Build weighted profile vectors can be limited to MeSH terms in ‘Geographical Area’ Rank nations Cholera: {0.6 Nigeria, 0.1 Malyasia , ……} Breast Cancer: {0.1 Poland, 0.8 Italy, ……}

  15. Research Prevalence: Mental Disorders (1961-2000) Ranking nations.

  16. Research Prevalence: Cholera (middle & low income; 1991 - 2000) Ranking nations

  17. Research prevalence versus disease prevalence Question: So how does the prevalence of research compare with the prevalence of the disease? For each disease: (a) Rank nations by Disease Prevalence (WHO epid. data) - estimated by # of cases reported or # of deaths Statistical Information System weekly epidemiological records (b) Rank nations by Research Prevalence Compare rankings using Spearman’s rank coefficient. Analysis limited to the decade of the 90s.

  18. 19 diseases Breast cancer Cholorectal cancer Hodgkins disease Meningitis Dengue Tuberculosis Liver neoplasms Prostate cancer Ovarian cancer Esophagus cancer Cholera AIDS Stomach cancer Melanoma Leprosy Malaria Yellow fever Trypanosomiasis Dracunculiasis

  19. *0.05 sig. level

  20. Observations: Diseases most prevalent in high or middle income group, have significant +ve correlation (9/10 diseases) Diseases most prevalent in low income group significant +ve correlation less likely (4/9, 44%).

  21. Temporal analysis on disease research Extract the top 3 ranked diseases studied in the context of each nation Pool these together How often does a disease rank in the top 3 positions?

  22. Topic: Each nation Sweden: {0.6 Breast Cancer, 0.1 Malaria , ……} Nigeria: {0.1 Breast Cancer, 0.8 Malaria, ……} Rank diseases

  23. Pooling: (for each decade & each income group)

  24. Observations from the study: Collecting epidemiological data is extremely complicated. Collect it at a fine grained analysis. Different forms of Leishmaniasis; Plague Complement existing efforts at collecting epidemiological data. Consider more complex phenomena such as the prevalence of Leishmania and HIV as co-infections. Research based evidence to explore policy issues.

  25. Conclusions: • Metadata can be exploited for text mining • MeSH ~ rich metadata scheme • Importance of metadata for digital libraries • Other text mining applications built on DL? • Domain independent ~ accounting! • Thank you!

More Related