200 likes | 344 Views
Searching for Credible Relations in Machine Learning Doctoral Dissertation. Vedrana Vidulin Supervisor: prof. dr. Matja ž Gams Co-supervisor: prof. dr. Bogdan Filipi č. Ljubljana, 3 February 2012. Introduction. Task: domain analysis of complex domains Problem:
E N D
Searching for Credible Relations in Machine LearningDoctoral Dissertation Vedrana Vidulin Supervisor: prof. dr. Matjaž Gams Co-supervisor: prof. dr. BogdanFilipič Ljubljana, 3 February 2012
Introduction • Task: domain analysis of complex domains • Problem: • When DM methods construct models on complex domains, the models often contain parts (relations) that are less-credible from the perspective of human analyst. • Less-credible parts can: • Lead to wrong conclusions about the most important relations in the domain • Undermine user’s trust in DM methods (Stumpf et al., 2009). • Proposed solution: a new method that in algorithmic way combines human understanding and raw computer power in order to extract credible relations – supported by data and meaningful for the human.
An Example • A decision-tree model is constructed: • With J48 algorithm in Weka, • From a data set that represents the impact of R&D sector on economic welfare of a country Class: Economic welfare 37 attributes: R&D sector 167 examples: Countries
Outline • Definition of credible relation • Human-Machine Data Mining (HMDM) method • Experimental evaluation • Conclusions and contributions
Credible Relation • Relation– a pattern that connects a set of attributes that describe the properties of a concept underlying the data and a class/target attribute that represents the concept. • Credible relation– of great meaning and of high quality: • Meaning – a subjective criterion attributed by the human based on the common sense, an informal knowledge about the domain, observed frequency and stability of the relation. • Quality– an objective criterion that indicates a support of the selected quality measures. • Credible model – composed only of credible relations.
How to Establish Credible Relations? The relation is composed ofattributes A1 and A2. Re-examine relation’s credibility by: Removing attributes A1 and A2 from data set Adding attributes A1 and A2 to If the relation is supported by evidence, add it to the list of candidates for credible relations.
The HMDM Algorithm Until no new interesting relations RepeatCreate several models (e.g., trees) Choose most interesting models For each interesting model Examine credibility of relations in the modelby adding and removingattributes from the data set Merge candidate relations with the output list of credible relations
The HMDM Algorithm (2) HMDM (data set) REPEAT Select DM method Select parameters and their ranges, define constraints Perform INITIAL_DM creating a list of models LM: FOR each interesting model M from LM, reexamine M: REPEAT Perform any of the following: { ADD_ATTRIBUTES REMOVE_ATTRIBUTES Expand credibility indicator } Evaluate the results with several quality measures and for meaning UNTIL no more interesting relations are found in the search space near the initial model Store credible relations and integrate conclusions END FOR UNTIL no more new interesting relations are found anywhere in the data set
HMDM: ADD_ATTRIBUTES Model:J48 trees Candidates for credible relations Quality: Accuracy (%) … A1 & A2 – combination
HMDM: REMOVE_ATTRIBUTES Quality: Accuracy (%) Model:J48 trees Candidates for credible relations … A1 || A3 – redundancy
Type-Credibility Scheme • Three levels of credibility: • Frequent and stable relations • Often appear in models • When added improve quality • When removed reduce quality • Frequent and less-stable relations • Often appear in models • When added sometimes improve quality and sometimes not • When removed sometimes reduce quality and sometimes not • Not supported by evidence
Quality Measures • The decision trees are evaluated according to: • Accuracy • Corrected class probability estimate (CCPE) • Kappa • The regression trees are evaluated according to: • Correlation coefficient • Relative absolute accuracy (RAA) • In addition, trees are evaluated according to – the total change in quality caused by adding and removing attributes:
Experimental Evaluation • Performed on three domains: • Research and development (R&D) • Higher education • Automatic web genre identification
R&D Domain: Remove Attributes Graph GERD-PC || GERD-GDP RES-HC || RES-FTE APP-NON-RES
Domains • Higher education • Goal: An analysis of the impact of higher education sector on economic welfare of a country • DM methods: J48 and M5P trees • Data: 60 attributes; 167 examples: countries; class: GNI per capita • Automatic web genre identification • Goal: Improve predictive performance by eliminating less-crediblerelations from J48 decision-tree models • Data: 500 attributes: words; 1,539 examples: web pages; class: 20 genres
R&D and Higher Education Domains – Credible Relations R&D • First level: increase the level of investment in R&D sector • Second level: • Increase the number of patents • Increase the number of researchers • Develop business enterprise sector as the key leader in R&D activities Higher education • First level: stimulate participation in higher education and improve student exchange programs • Second level: • Increase the level of investment in all levels of education (“low”) • Increase number of graduates in science programs (“middle”) • Attract more foreign students (“middle”)
Evaluation • User studyon 22 participants: • 64% of participants did not recognize less-credible relations in the single model • When presented with credible models all accepted credible models as better
Conclusions • A novel method Human-Machine Data Mining (HMDM) was designed that combines human understanding and raw computer power to extract credible relations from data. • The HMDM method was evaluated on three complex domains showing that: • the method is able to find important relations in data • credible models are better in quality than the models constructed by automatic DM methods • humans accept credible models
Contributions • The main contributions: • A new method Human-Machine Data Mining (HMDM) was designed for extracting credible relations from data • The CCPE statistical measure, originally conceived for classification rules, was extended for decision trees • Interactive explanation structures in the form of added and removed attributes graphs were designed, conceived to facilitate the extraction of credible relations • Additional contributions: • A computer program was developed to support the HMDM method • The analysis of three real-life domains