1 / 21

Abstract

Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001. James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. Ries

dominy
Download Presentation

Abstract

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals:An Exploratory StudySeptember 4, 2001 James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. Ries CECS, HMI, Statistics, and SISLT

  2. Abstract • Retrieval tests have assumed that the abstract is a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval … • … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent

  3. Background • Many retrieval systems still use abstracts as a surrogates for full text. • Abstracts are often indexed with respect to word occurrence by employing Zipf’s Law. • Product of occurrence frequency and rank of occurrence frequency is constant • Most occurring and least occurring words contribute little to article content.

  4. Background (cont.) • Previous studies have shown that abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.

  5. Methods • 4 medical journals (BMJ, JAMA, Lancet, and NEJM) • Two different countries • Many medical subdisciplines • Regarded as top journals • Available in electronic format • Studied all articles which contained an abstract and were 2 pages or longer during 1999. • 1,138 articles – 35 parsing problems = 1,103 articles

  6. Methods (cont.) • Text of articles and abstracts were downloaded and stored in HTML. • HTML was parsed into separate abstract and article files via custom C++ parsing program. • References and figures were removed.

  7. Methods (cont.) • “Content-bearing words” extracted from abstracts and articles • Numerical values, special characters, and captions excluded and used as word delimiters • Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)

  8. Methods (cont.) • Remaining words conflated using NLM’s LVG tools. • E.g, “reading” -> “read” • Frequencies of all conflated words were calculated for abstracts and articles.

  9. Analysis • Used chi-squared test to determine whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.

  10. Analysis (cont.) • Example: Rosing (Lancet) • Abstract contained 140 content bearing words • “contraceptive” appeared 6 times in the abstract and 35 times in the text of the article. • Since text contained 1081 content bearing words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.

  11. Analysis (cont.) • Example: Rosing (Lancet) • Actual number of occurrences was 6, the square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10). • Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.

  12. Analysis (cont.) • We reran our analysis using the Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.

  13. Cumulative Results w/o Bonferroni

  14. Cumulative Results w/o Bonferroni

  15. Cumulative Results w/ Bonferroni

  16. Cumulative Results w/ Bonferroni

  17. Future Work • Utilize a smaller, more standard stop word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings). • Explore “over agreement”.

  18. Future Work (cont.) • Compare phrases (terms) rather than words. • Utilize the UMLS to compare Concept Unique Identifiers (CUI’s) via MetaMap rather than words or phrases. • Changes in agreement/disagreement may indicate the use of synonyms which might still negatively affect retrieval.

  19. Conclusion • In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent. • Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.

  20. Acknowledgements • This research was supported in part by grant T15-089 LM0708-09 from the National Library of Medicine, United States of America.

  21. Questions • http://riesj.hmi.missouri.edu • JimR@acm.org

More Related