Supporting Creativity in Science: Cooperative Knowledge Acquisition & Knowledge Refinement Systems Derek Sleeman Dep

Supporting Creativity in Science: Cooperative Knowledge Acquisition & Knowledge Refinement Systems Derek Sleeman Department of Computing Science The University ABERDEEN AB24 3FX Tel: +44 (0)1224 272296 Email: d.sleeman@abdn.ac.uk WWW: http//www.csd.abdn.ac.uk Acknowledgements: EPSRC support for the AKT Consortium Students: Eugenio Alberdi, David Corsar, Andy Aiken, Mark Winter

OVERVIEW of TALK I: Context: Advanced Knowledge Technologies (AKT) Consortium II:Co-operative Knowledge Acquisition & Knowledge Refinement Systems. III: ReTAX system IV: The REFINER++ System Questions / Discussion

I: AKT’s CHALLENGES Knowledge Acquisition Knowledge Maintenance KnowledgeModelling Life Cycle, Integration Issues & Testbeds Knowledge Reuse Knowledge Publishing Knowledge Retrieval

II: Co-operative KA & Knowledge Refinement Systems • Knowledge-Based systems inevitably require a sizeable amount of • domain knowledge. This can be acquired from: • domain experts (KA) • detailed examples (using ML techniques) etc • However for complex tasks these KBs are inevitably • incomplete when further Knowledge-Acquisition is needed; • inconsistent when the KB needs to be refined. • also it is likely that background knowledge will be incomplete; thus requiring an expert to act as an oracle. • Hence the need for: Co-operative (Problem Solving) Knowledge • Acquisition & Knowledge Refinement Systems

II: Co-operative KA & Knowledge Refinement Systems KRUST (Classical KB; Classification) (Susan Craw) STALKER (Efficient Truth Maintenance based system; Classification) (Leo Carbonara) REFINER/Refiner++ / R5 (Case-base; Classification) (Sunil Sharma; Mark Winter; Andy Aiken) RETAX (Revision of Taxonomies) (Eugenio Alberdi; David Corsar) CRIMSON (Refinement of Constraints) (Mark Winter) TIGON Time Series Data/Causal Model (Diagnosis) (Fraser Mitchell) SALT+ Rules & Constraints; Propose & Revise (Piero Leo) References see - WWW: http//www.csd.abdn.ac.uk

II: Co-operative KA & Knowledge Refinement Systems KRUST & Wine Adviser STALKER REFINER+ Attendance at Medical Clinics & Stock control CRIMSON/ConRef Stock control RETAX Botanical Taxonomies TIGON Turbines (Fault Detection & Diagnosis) SALT+ Elevators/Lifts References see - WWW: http//www.csd.abdn.ac.uk

III: RETAX+ • The heuristics in RETAX are based on a study to determine how Botanists reacted to a rogue item(s). • There are 2 (principal) rules which determine whether a taxonomy is well formed: • each child node must be more specialized that its parent • each of a node’s siblings must be unique. • Retax was used to replicate the revision of a major botanical taxonomy done “manually” in Aberdeen’s Botany dept in the 90s. • References: Middleton & Wilcox (1990) Edinburgh Journal of Botany {revision of taxonomy for Pernettya / Gaultheria} • Alberdi & Sleeman (1997) AI Journal, p257-279. • Alberdi, Sleeman & Korpi (1999) Cognitive Science Journal

Vehicle Train Car Cycle Lorry Sports Car Salon Car Bicycle Motorbike Large Lorry Small Van Smaller Van

RETAX+ Let’s refer to a new object/node as N, the existing hierarchy/tree as T, and the potential parent node as P. Then possible operations are: • Is T well formed? (If not report nodes which violate the rules.) • {E.G., If Sibling nodes N1 & N2 are equal, then merge the 2 nodes.} • Is N already in T? • Assuming T is well-formed, to which parent node, P, can N be attached without causing T to be rearranged or N modified? (Answer could be none) • What changes have to be made to N to make it a “legal” child of node P? • What changes have to be made to T so that N can be a child of P? • Combinations of the last 2 operations

ReTAX • Ericaceae • Arctostaphylos Arbutus Pernettya Leucothoe Gaultheria Agauria Andromeda • A. uva-ursi A. unedo P. tasminica G.oppositfolia G. rupestris G. antipoda A. polifolia

ReTAX • - Historical: In Bentham & Hooker’s (1876*) classification the main differences detected between the Pernettya & Gaultheria genera were type of fruit and succulence of the calyx features. • *G Bentham & JD Hooker (1876). Genera Plantarum, Vol II, Part2. (Publ: Reeves & Co, London) • - Subsequent botanical investigations in the 20th Century challenged this analysis, but did not suggest any further distinguishing features for the 2 genera; hence the 2 genera were combined, (Middleton & Wilcox, 1990).

ReTAX • Simulation (Simplified) • - The descriptions of several species of the Pernettya & Gaultheria genus were replaced by others with revised features (descriptors) which effect the definitions of the parent nodes (P +G) • - When parent nodes (Pernettya & Gaultheria) are found to be the same, the system checks a set of other features (further facility of ReTAX) to see if they are distinctive & when no differences are found, the 2 nodes (P+G) are collapsed

RETAX+: Current / Future activities • Use with other experts to help them formulate / refine taxonomies (eg other aspects of botany, microbiology) • Use RETAX+, or a variant, to formulate / refine ontologies (eg medical terminologies). This has resulted in the Protégé RepairTAB which detects inconsistencies on OWL Ontologies & gives advice about removing inconsistencies. (Lam, Sleeman, Pan, & Wasconcelos (2008) Journal of Data Semantics)

IV: REFINER++ System • The Refiner++ algorithm • Sample dataset • Interaction with experts • Current / future work

The Sample Dataset

The Refiner++ Algorithm • Each case is assigned to a category • Category descriptions are inferred from the case values • When a case matches a category it was not assigned, by the expert, this is an inconsistency • While inconsistencies exist… • A selection of disambiguation strategies are suggested • The user chooses a strategy to be performed • The list of inconsistencies is re-evaluated • The refined dataset is now consistent

Generating Descriptions • Generalise each field • Numeric: range from lowest to highest • String: set of all unique items • Taxon: nearest common parent • Boolean: set of all unique items from the set {‘true’, ‘false’, ‘any’} • Combine to get category description

Category Descriptions • There are inconsistencies: • Cases 4 and 5 match A • Case 7 matches B • We need to remove the overlap

Disambiguation Strategies • Change values for certain cases • Remove values from a category (eg, create a disjunction) • Reclassify a case • Make a case match an additional category • Shelve a problem case • Add a new field

Refiner++ C2 C1 C3

Strategies for this problem • Change value of DBP in case 7 to 90 • Change value of DBP in case 5 to 95 • Reclassify case 7 to category B • Add case 7 to category B • Shelve case 7 • Change value of Disease in cases 3 and 7 to D3 • Reclassify cases 4 and 5 to category A • Add cases 4 and 5 to category A • Shelve cases 4 and 5 • Add a new field

Strategy Ordering • Typically, many strategies are suggested • We need heuristics to order them • Ordered by number of times suggested; prefer strategies which are suggested many times • Ordered by number of cases affected; prefer strategies which affect fewer cases

The Refiner++ Main Screen

Scalability • Measured the time taken to • perform validation on • randomly-generated datasets • with varying numbers of • cases, fields and categories • For most datasets, time taken • is under 1 second

Use of REFINER++ by Experts* • Refiner++ has been used with various experts including: • Pain Control Expert (Anaesthesiology) • Child psychologist • High Dependency Unit (HDU) Physician • * KCAP-2003 paper (Aiken & Sleeman)

Pain Control • Pre-existing Access dataset on epidural patients • Many cases, lots of fields / descriptors • Refiner++ imported the data (almost) perfectly • Expert categorised cases based on the length of the epidural (in days) • REFINER++ took only a few seconds to create category descriptions and validate • But…

Pain Control • Hundreds of inconsistencies found • Hundreds of strategies suggested • Almost all which were ‘change value’ • Why did it not work better? • Subjective nature of the subject domain. • Categories were contiguous

Child Psychology • The session was a series of anecdotes and outlines of specific cases • Three types of cases were identified: • Severely autistic • Mildly autistic • Difficulties with language development

Child Psychology • The expert stated that autistic children usually had the • following characteristics: • Problems with language and verbal communication • Problems with social interaction • Obsessive behaviour • These characteristics were abstracted by the knowledge • engineers and subsequently confirmed with the expert • The expert showed no inclination to use REFINER++, but a case set was created by the knowledge engineers

HDU • Task poised by domain expert: when to move high dependency unit (HDU) patients to a general ward, or the intensive care unit (ICU), or leave them in the HDU. • Used Refiner++ with three datasets one for each condition (cardiac, neuro & respiratory) • Expert did not use the system but did dictate the descriptors & the sets of cases to the knowledge engineers who typed this information into REFINER. • Refiner++ found 2 categories were consistent; & in the third identified inconsistencies

Inconsistent Dataset

Category Descriptions • There are inconsistencies: • Case 1 matches Category SAME • Case 4 matches Category HIGHER • We need to remove the overlap • Refiner++ suggested lower and upper ‘danger zones’ for each field

Future Work: Use with Domain Experts • Make the system’s GUI more intuitive (some changes already made) • Ask expert to come along to the session with a document which summarizes the main features of the dataset they wish to discuss. (In session ask them to highlight principal concepts) • For each domain expert contacted, record an AVI session of a simple but related domain (eg simple childhood diseases before approach a paediatrician) (demo)

Current Work (ICU domain) • Developed system which is statistically based, so given a case description it returns the likelihood of that case belonging to one of the predefined categories (R5: Andy Aiken) • Acquired data set of patients’ physiological parameters from an ICU DB, and have clinicians assign patients on day-by-day & hour-by-hour to a 5-point severity score. (Develop in conjunction with Glasgow Royal Infirmary) • Using R5 with the above data set to assign new patient reports to a severity class. (Practically important as the descriptors include clinical interventions which “standard” scales don’t.) • Identify & analyse (explain) anomalous / unusual cases (segments of cases)

VI: Dimensional Analysis ?? • Outline issue • Pointer to TR • Pointer to WWW systems / sources

Questions/Comments

V: (Causal) Explanations for Anomalous Medical cases • Discuss ICU context • Experiment to detect Anomalous cases / sections of cases • Outline a typical investigation

V: Seeking to Explain an anomalous Observation • EXPECTED: An injection of X will cause the heart (Organ, O) to increase its contraction rate within T seconds. • SUPPOSE that does not happen, then here are some of the investigations which might be performed: • Is the injection being given effectively • IF so then check whether the drug X is being transported to Organ, O • Is the transport path physically / bio-chemically blocked? • Is the transport mechanism inhibited slowed down? • IF the drug is actually arriving at Organ O & the conc is OK, then investigate: • Is the drug mechanism within the organ being blocked? • Is the organ for some reason unable to respond in the usual way (eg weaken heart muscle)

Supporting Creativity in Science: Cooperative Knowledge Acquisition & Knowledge Refinement Systems Derek Sleeman Dep