240 likes | 366 Views
Networked Knowledge Organization Systems/Services Workshop June 28, 2001. Tools For Classification Integration. Brian A. Carlsen Apelon, Inc. Presentation Outline. State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches.
E N D
Networked Knowledge Organization Systems/Services Workshop June 28, 2001 Tools For Classification Integration Brian A. Carlsen Apelon, Inc.
Presentation Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches
State of the UMLS Metathesaurus • Concept orientation, concept persistance • Growth to over 800,000 concepts and over 60 vocabulary families • Over 1000 users worldwide • Uses of the Metathesaurus • Natural Language Processing • Knowledge Representation • Patient Record Systems • Linking Patient Data to Knowledge Sources • Automated Indexing/ Retrieval
Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches
Life-cycle of a Source: Inversion • Source arrives in “machine readable” format* • Many formats are used, including PDF, Clipper dump files, WordPerfect files, unit-record formats, and relational flat files. • Source undergoes “inversion” • Requires a human • Input is this machine readable file • Process is source-specific • Output is a common relational flat-file format used internally.
Life-cycle of a Source: Insertion • A “Recipe” is created • Test insertion to validate recipe • Insertion and matching. • Load common format into database • Match to existing content algorithmically • Use string normalization • Determine SAFE vs. UNSAFE matches • Prepare data for editing • Process is fully undoable
Life-cycle of a Source: Editing • Predicate-based partitioning • Workflow management • Review ALL content for new sources • Review UNSAFE content for updates • Human Review • QA Driven Editing • Source-specific QA • Feedback QA • Conservation of Mass QA
Life-cycle of a Source: Release • Synchronize editing changes • State-based model • Release data in desired format • Full release/partial release • Transform base release • “MetamorphoSys” • Remove unlicensed data • Create “Content Views”
Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches
Tools and Processes: Overview • Humans vs. Computers • Humans are good at making content decisions • Computers are good at automating tasks • Tools vs. Processes • Tools enable computers to automate tasks • Processes keep humans productive.
Tools and Processes: Pre-Editing • No common data representation • Source-by-source conversion to common format • Perl, Unix tools • What would a common format need? • Represent terms and attributes • Represent within-source relationships • Represent hierarchies • Represent external-source relationships • Represent classifications (e.g. Concept)
Tools and Processes: Editing • Workflow Management • Report Generation • State Model vs. Action Model • Actions represented as new states vs. • Single state + actions as data • Human Editing • Interface enabling “high level cognitive editing” • LVG: String Normalization • Automated Editing • Save vs. Unsafe, Integrities
Tools and Processes: Release • License Agreements • Content Views • e.g. Indexing View • Filter by Semantic Type • Filter by Language • Alternative Release Formats • Updates • MetamorphoSys
Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches
Challenges: Ambiguity • Ambiguous Strings • e.g. “Cold” • Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging. • Not fully specified Strings • e.g. “Head of Pancreas” within “Malignant Neoplasm of Pancreas” • Solution: Fully specified preferred name.
Challenges: What is a Classification? • A classification is any grouping of terms with a consistent semantics. • Thesauri typically group terms by meaning into concepts (synonymy). • Alternatives • Neighborhoods (e.g. Descriptors in MeSH). • Near-synonymy • No classification (identity or term classification). • Lexical • Connecting relationships/attributes to classifiers
Challenges: Precedence • Concepts (or other classifications) generally have a preferred name • A thesaurus will have terms from different sources competing for precedence • Source precedence should be a user-level choice • Preferred name should not be used as a proxy for concept-ness • Every level of classification should have a preferred term • Preferred name exists primarily for “face validity”
Challenges: Update Model • Constituent sources of a thesaurus will be updated • Editing cycle • Updated sources will require editing • Typically overlap is > 90% • Overlap can safely replace the old version’s content • Safe replacements should not be edited • Ideally, source providers would indicate replacement otherwise it must be computed • Release • Release changes
Outline • State of the UMLS Metathesaurus • Life-cycle of a Source • Tools and Processes • Challenges • Further Approaches
Further Approaches: Description Logic • What is it? • Concepts (or other classifications) are axioms • Relationships (roles) are theorems • The transitive closure of the roles across the concepts is computed to ensure no violations. • e.g. A isa B, B isa C, C isa A (!violation) • When is it useful? • In formalized, static domains like Anatomy • When is it not useful? • Performance > formalism • In dynamic, loosely coupled domains like Genomics
Further Approaches: Standards XML • Standardized Terminology/Ontology Representation • XML is the most likely candidate • Ideally would support • Links to external sources • Relationships between different levels of classification • Update model • Description Logic Metadata • Standardized Thesaurus Representation • XML Repository • Standard Object Representations
Conclusion: Lessons Learned • Use the Web • Use current technology • Use Description Logic where appropriate • Make editing intuitive • Automate tasks • “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “ • Review UNSAFE automated tasks. • Stop automating when marginal utility falls below a threshold.