1 / 24

Brian A. Carlsen Apelon, Inc.

Networked Knowledge Organization Systems/Services Workshop June 28, 2001. Tools For Classification Integration. Brian A. Carlsen Apelon, Inc. Presentation Outline. State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches.

Download Presentation

Brian A. Carlsen Apelon, Inc.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Networked Knowledge Organization Systems/Services Workshop June 28, 2001 Tools For Classification Integration Brian A. Carlsen Apelon, Inc.

  2. Presentation Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches

  3. State of the UMLS Metathesaurus • Concept orientation, concept persistance • Growth to over 800,000 concepts and over 60 vocabulary families • Over 1000 users worldwide • Uses of the Metathesaurus • Natural Language Processing • Knowledge Representation • Patient Record Systems • Linking Patient Data to Knowledge Sources • Automated Indexing/ Retrieval

  4. Concept and Name Counts By Release Year

  5. English Word, String Counts by Release Year

  6. Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches

  7. Life-cycle of a Source: Inversion • Source arrives in “machine readable” format* • Many formats are used, including PDF, Clipper dump files, WordPerfect files, unit-record formats, and relational flat files. • Source undergoes “inversion” • Requires a human • Input is this machine readable file • Process is source-specific • Output is a common relational flat-file format used internally.

  8. Life-cycle of a Source: Insertion • A “Recipe” is created • Test insertion to validate recipe • Insertion and matching. • Load common format into database • Match to existing content algorithmically • Use string normalization • Determine SAFE vs. UNSAFE matches • Prepare data for editing • Process is fully undoable

  9. Life-cycle of a Source: Editing • Predicate-based partitioning • Workflow management • Review ALL content for new sources • Review UNSAFE content for updates • Human Review • QA Driven Editing • Source-specific QA • Feedback QA • Conservation of Mass QA

  10. Life-cycle of a Source: Release • Synchronize editing changes • State-based model • Release data in desired format • Full release/partial release • Transform base release • “MetamorphoSys” • Remove unlicensed data • Create “Content Views”

  11. Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches

  12. Tools and Processes: Overview • Humans vs. Computers • Humans are good at making content decisions • Computers are good at automating tasks • Tools vs. Processes • Tools enable computers to automate tasks • Processes keep humans productive.

  13. Tools and Processes: Pre-Editing • No common data representation • Source-by-source conversion to common format • Perl, Unix tools • What would a common format need? • Represent terms and attributes • Represent within-source relationships • Represent hierarchies • Represent external-source relationships • Represent classifications (e.g. Concept)

  14. Tools and Processes: Editing • Workflow Management • Report Generation • State Model vs. Action Model • Actions represented as new states vs. • Single state + actions as data • Human Editing • Interface enabling “high level cognitive editing” • LVG: String Normalization • Automated Editing • Save vs. Unsafe, Integrities

  15. Tools and Processes: Release • License Agreements • Content Views • e.g. Indexing View • Filter by Semantic Type • Filter by Language • Alternative Release Formats • Updates • MetamorphoSys

  16. Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches

  17. Challenges: Ambiguity • Ambiguous Strings • e.g. “Cold” • Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging. • Not fully specified Strings • e.g. “Head of Pancreas” within “Malignant Neoplasm of Pancreas” • Solution: Fully specified preferred name.

  18. Challenges: What is a Classification? • A classification is any grouping of terms with a consistent semantics. • Thesauri typically group terms by meaning into concepts (synonymy). • Alternatives • Neighborhoods (e.g. Descriptors in MeSH). • Near-synonymy • No classification (identity or term classification). • Lexical • Connecting relationships/attributes to classifiers

  19. Challenges: Precedence • Concepts (or other classifications) generally have a preferred name • A thesaurus will have terms from different sources competing for precedence • Source precedence should be a user-level choice • Preferred name should not be used as a proxy for concept-ness • Every level of classification should have a preferred term • Preferred name exists primarily for “face validity”

  20. Challenges: Update Model • Constituent sources of a thesaurus will be updated • Editing cycle • Updated sources will require editing • Typically overlap is > 90% • Overlap can safely replace the old version’s content • Safe replacements should not be edited • Ideally, source providers would indicate replacement otherwise it must be computed • Release • Release changes

  21. Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches

  22. Further Approaches: Description Logic • What is it? • Concepts (or other classifications) are axioms • Relationships (roles) are theorems • The transitive closure of the roles across the concepts is computed to ensure no violations. • e.g. A isa B, B isa C, C isa A (!violation) • When is it useful? • In formalized, static domains like Anatomy • When is it not useful? • Performance > formalism • In dynamic, loosely coupled domains like Genomics

  23. Further Approaches: Standards XML • Standardized Terminology/Ontology Representation • XML is the most likely candidate • Ideally would support • Links to external sources • Relationships between different levels of classification • Update model • Description Logic Metadata • Standardized Thesaurus Representation • XML Repository • Standard Object Representations

  24. Conclusion: Lessons Learned • Use the Web • Use current technology • Use Description Logic where appropriate • Make editing intuitive • Automate tasks • “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “ • Review UNSAFE automated tasks. • Stop automating when marginal utility falls below a threshold.

More Related