1 / 60

What is Semantic Publishing? And Why Should I Care?

What is Semantic Publishing? And Why Should I Care?. Jabin White Director of Strategic Content Wolters Kluwer Health – P&E May 13, 2010 PSP Presents – Semantic Publishing: An Introduction. Agenda. Introductions Some definitions Vocabularies, Taxonomies, and Ontologies , Oh My!

knoton
Download Presentation

What is Semantic Publishing? And Why Should I Care?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is Semantic Publishing?And Why Should I Care? Jabin White Director of Strategic Content Wolters Kluwer Health – P&E May 13, 2010 PSP Presents – Semantic Publishing: An Introduction

  2. Agenda • Introductions • Some definitions • Vocabularies, Taxonomies, and Ontologies, Oh My! • What is metadata, and why should publishers care? • What is semantic tagging, and why should publishers care? • Impact of all this on publishers’… • Workflows/processes • Business cases • The Semantic Web • Final Thoughts, Recommendations

  3. Introductions: My Company • Director of Strategic Content for Wolters Kluwer Health – Professional & Education • Wolters Kluwer Health includes: • Lippincott Williams & Wilkins titles • Ovid • UpToDate • Provation Order Sets • Drug Facts & Comparisons • Medi-Span • Clin-eguide

  4. Introductions: Me • Started as Editorial Assistant • Dove into SGML in the mid-90s working on drug reference • Six years at Elsevier in Electronic Production • Don’t typecast me! • Joined WK Health in May 2009 • Responsible for making sure content flows through company more efficiently (DTDs, Content Management, Authoring Tools, Semantic Enrichment, Product Information Management, etc.)

  5. The Web - Stop the Insanity! • A few humble web stats: • There are 2 billion (billion!) Google searches daily • There are 1 trillion (1,000,000,000,000) unique URLs in Google’s index • There are 2,695,205 articles in English on Wikipedia • It would take 412.3 years to view all the content on YouTube (3/08), but don’t try, because there are 13 hours of video uploaded every minute ** Source: Adam Singer’s “Social Media, Web 2.0 and Internet Stats site: http://thefuturebuzz.com/2009/01/12/social-media-web-20-internet-numbers-stats/

  6. So What? • Clay Shirky’s concept of “Filter Failure” • When the capacity of people to “keep up with” information is exceeded, curation becomes the value differentiator

  7. Definitions • Controlled vocabulary: a bunch of words, no relationships • But there is advantage if all users use the same terms to describe things • Taxonomy: is a controlled vocabulary with hierarchy • Thesaurus: is interchangeable with controlled vocabulary, also sometimes referred to as an ontology • Ontology: all of the above; think neural network with a bunch of relationships • MetaData: data about data (we’ll get to that)

  8. Some Level-Setting • Unfortunately, these definitions have been diluted to the point of uselessness by their misuse • Think “Content Management” around the year 2000 • MetaThesaurus – a collection of all of these things • EXAMPLE: UMLS

  9. Information Classification • Pretty Wonky, Pretty Fast • Hyperonym: Broader Term, more general • car is a hyperonym of pinto) • Hyponym: Narrower Term • Baseball is a hyponym of sports • Meronym: part term • Kansas is a meronymof United States • Holynym: whole term • European Union is a holynmof France

  10. Taxonomies in STM

  11. Some Heavy Hitters • UMLS • MeSH • SNOMED-CT • ICD-9 and ICD-10 • RxNORM • LOINC, ICPC-93, and VA/KP Subset of SNOMED

  12. UMLS – Unified Medical Language System • More than 5 million terms or named entities • Divided into concepts, and each term has unique identifier • Not a vocabulary, but a mapping BETWEEN vocabularies

  13. UMLS • Vocabularies included in the UMLS: • MeSH Headings in 8 languages • ICPC-93 in 14 languages • WHO Adverse Drug Reaction Terminology in 5 languages • SNOMED-2, SNOMED-3, and UK Clinical Terms (former Read Codes) • ICD-10 in English and German • ICD-10-AM (Australian Modification) • ICD-9 (US Modification)

  14. The Semantic Network (UMLS) • Semantic types are big things like Disease, Syndrome, or Clinical Drug • Semantic relationships are useful links between semantic types (ie, Clinical Drug treats Disease or Symptom)

  15. One Concept, Many Names

  16. MeSH – Medical Subject Headings • An 11-level hierarchy developed and maintained by the National Library of Medicine, part of the US Department of Health and Human Services • The indexing method for MEDLINE/PubMed • Contains more than 16 million references to journal articles in the life sciences, with concentration in biomedicine • 5,200 journals worldwide in 37 languages • Since 2005, 2,000-4,000 references are added daily, Tuesday-Saturday, all indexed to MeSH • Loading suspended for two weeks every November/December while MeSH is updated

  17. The MeSH Staff

  18. SNOMED-CT • Systemized Nomenclature of Medicine (Clinical Terms) • 344,000 concepts, arguably the most complete clinical taxonomy in the world • Developed and maintained by the College of American Pathologists • Licensed by NLM, freely available to license as part of UMLS • US Standard for electronic health information exchange by Health IT standards panel • Adopted for use by US government through the Consolidated Health Informatics (CHI) initiative

  19. ICD-9 and ICD-10 • International Classification of Diseases • Version 9 moving to Version 10 (US is slower than rest of the world on this) • Codes that define diseases: • Example: 411.0 = Postmyocardial infarction syndrome (aka, Dressler’s Syndrome) • Used to drive insurance re-imbursements, billing, and other classifications of diseases • Used to figure morbidity and mortality figures by US government

  20. RxNorm • Standardized names for drugs, collections of drugs, and delivery devices • Like MeSH, developed and maintained by National Library of Medicine • Also includes standard way of expressing generic and trade names, ingredients, strengths, and dose forms

  21. LOINC Mapping Files • Logical Observation Identifiers Names and Codes • A set of universal names and ID codes for identifying laboratory and clinical test results • Used to better communicate with HIT (Health Information Technology) systems • Not much of an impact on publishers, but we should know about them

  22. 1/3

  23. What is Metadata, and Why Should Publishers Care?

  24. What is Metadata? • Reading most definitions of metadata and related standards is like trying to resolve disputes with my kids • Metadata is “data about data” • But what does that mean? • Its use may be increasing, but metadata is NOT new

  25. Why Should Publishers Care • In the move from print publishing to digital, metadata is a powerful tool to help publishers get content in the right place, in the right format, and known to the right systems and people, at the right time • Print books were easy • Everyone knew what they were • You could really only use them one way • They had a beginning, an end, a physical presence, and a set price (mostly)

  26. Why Should Publishers Care • Today, computers are often communicating with one another as much as they are with users (people) • Metadata becomes critical in: • B2B relationships • Enhancing B2C relationships • B2-_________ relationships • The quality of the metadata gives publishers a more powerful voice in what happens to their content

  27. Why Should Publishers Care? • For example: • A digital asset (an image) • What file format is it? • How big is the image? • Who took the picture? • Who owns the picture? • Can you use it on your web site? If you do, what credit do you have to give to the owner? • What date was it created? • Is it part of a collection? • Is it related to another piece of content? • Does it stand alone or is it part of a group of images?

  28. Publishers Should Care • If a publisher’s goal is to disseminate content to the widest possible audience, metadata is critical

  29. Publisher Relationships • Again, in books you had one use model • Metadata allows publishers to have diverse relationships with content consumers and other information providers • Customers (duh) • Aggregators • The Open Web (not Google, but other search engines) • But don’t try to “game” the search engines with adult keywords; that’s just wrong • There have been lawsuits over use of meta keywords, including Playboy suing two adult web sites • Technology partners/developers • Systems wherein content is a “value add” • Multiple output formats

  30. Types of Metadata • HTML Metadata • <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> • <meta name="verify-v1" content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7FY65w=" /> • <meta name="description" content="International publisher of professional health information for physicians, nurses, specialized clinicians & students. Medical & nursing charts, journals, and pda software."> • <meta name="keywords" content="springhouse, medical book, nursing journal, medical pda software, lippincott medical reference, lww, lippincott, lww com, medical publisher"> • <link rel="stylesheet" href="/css/style.css" type="text/css"> For people For search enginges

  31. Types of Metadata • Classifying Metadata • ISBN (I told you this wasn’t new) • Dewey Decimal System • Books in Print/CIP/Library of Congress data • MARC records • DOI (Digital Object Identifier) • Descriptive Metadata (sorry, my examples are from STM) • ICD-9 and ICD-10 Codes • MeSH • SNOMED-CT • NANDA, NIC, NOC for Nursing • NDC, HCPCS for drugs OLD NEW

  32. Types of Metadata • Classifying Metadata • ISBN (I told you this wasn’t new) • Dewey Decimal System • Books in Print/CIP/Library of Congress data • MARC records • DOI (Digital Object Identifier) • Descriptive Metadata (sorry, my examples are from STM) • ICD-9 and ICD-10 Codes • MeSH • SNOMED-CT • NANDA, NIC, NOC for Nursing • NDC, HCPCS for drugs OLD NEW • DOI (Digital Object • Identifier)

  33. Semantic Metadata • Using controlled vocabularies, extra power can be added to content via semantic tagging to drive: • More precise searching • Contextually-based connections • Lowering of “two terms meaning the same thing” syndrome (hypertension vs. high blood pressure; heart attack vs. myocardial infarction) • Filling in of content gaps • Semantic tagging *is* metadata, but it deserves its own section (coming up)

  34. What is Semantic Tagging?

  35. Semantic Basics • Semantics is tagging that describes what content *is* and not how it should *look* on the page or screen • Contrast to structural tagging, which is made of elements such as <para>, <list>, and <title> • Both are XML, but semantics is like XML on steroids! • Doing semantic tagging without a controlled vocabulary is madness for scholarly publishing • Think “folksonomies”

  36. Manual Tagging • DESCRIPTION: A subject matter expert (SME) reads chapter/article, indexes or tags based on content, resulting in enriched content • POSITIVES – If precision needed, and clinical understanding of concepts (ie, judgment) required, probably still the best option • NEGATIVES - Cost prohibitive on large volumes of information; not scalable; inconsistency if controlled vocabulary not followed, or different taggers used

  37. Manual Tagging – Other Factors • Offshore resources have improved in recent years as “knowledge work” has gone global, resulting in cost reductions • Some processes considered “too expensive” to be done manually before could be revisited • Great dependence on *type* of content, which means use cases should drive workflow decisions

  38. Automated Approaches • DESCRIPTION: Software crawls content, adds tags/unique identifiers or finds concepts & patterns to drive more intelligent search or entity extraction • POSITIVES – Very effective in finding “trends” or concepts over a large repository of data; growing industry because of information overload (aka Data Mining, Text Analysis) • NEGATIVES – Sometimes leads to false positives, lack of precision or judgment by machines processing data

  39. Automated Approaches – Other Factors • If used effectively, quick wins on large repositories • Can be used to accomplish projects that would never be attempted (or approved) manually

  40. Combination Approaches • DESCRIPTION: Automated process followed by SME checking (deeper level than straight QA) and addition of specific conceptual information • POSITIVES – best of both worlds for projects that deserve it; can drive precision but can also cover large repositories • NEGATIVES – costs; every time software or people act on your content, there are costs – you don’t get a discount from either because you are doing both 

  41. FUD Around Semantic Search • Semantic Search engines • TEMIS, Collexis, NetBase, Vivisimo, OpenCalais • Finding semantic concepts based on entities and search algorithms • Finding a needle in a haystack • Semantic Tagging • People (SMEs) identify concepts and tag accordingly • Drives precision in search and other things • Finding the right needle in a stack of 10 needles

  42. A Note About “Folksonomies” • Having users “tag” or classify data is increasing in popularity • Not much use in clinical areas of health sciences • If you are sick, do you want to know what 100 people think, or the one expert?

  43. 2/3

  44. Impact on Publishers

  45. Impact on Publishers • Impact depends on how deep you want to go • i.e., what am I going to get in return for investing in metadata, and is it worth it? • More and more, this is not an “if” proposition, it’s “how much” • Publishers who buy in have two basic choices on approach:

  46. Option 1: Metadata in the Workflow • Requires deeper commitment, but has bigger potential upside • Positive impact on product creation and development • Requires thinking about tools, workflows, and enterprise-level systems to allow for creation and MAINTENANCE of metadata • Combination of good metadata in the workflow and creativity in product development team can pay big benefits • Allows participation of authors (or subject matter experts in lieu of) at the beginning of the workflow

  47. Option 2: Outside the Workflow • Requires lesser commitment, but potentially fewer rewards • Can be done with zero impact on current systems • Has benefit of content being in “final form” (whatever that means anymore) when intelligence is added in metadata • Can keep SMEs as a separate offshoot of the workflow – easily outsourced • Can attack this problem with brute force semantic search engines, but this is a different thing

  48. Impact on Publishers • Active vs. Passive Metadata • Active metadata • Publisher intentionally associates markup with certain pieces of content • Often using controlled vocabulary • Includes semantic indexing • Can also be machine-based, using scripts, etc. • Passive metadata • Metadata created based on use of content • Image X was used as part of an image bank on pediatric • Inheritance of properties from parent objects

  49. Implications for Search • Machines don’t know the difference between hypertension and high blood pressure • More accurately, machines don’t know they are the SAME • How this is handled is a matter of User Experience (did you mean? … give them the result … etc.), but the content must be tagged first

More Related