The lack of transparency of many state-of-the-art data mining techniques, such as support vector machines, artificial neural networks or random forests, renders them useless in any domain where comprehensibility is of importance. Rule extraction is a techinque designed to remedy this, which extracts comprehensible rules that mimic the decisions made by the black box model.


Abstract: Advances in data mining have led to algorithms that produce accurate regression models for large and difficult to approximate data. Most of these use non-linear models to handle complex data-relationships in the input data. Problematic is their lack transparency, a requirement in many application domains where comprehensibility is of key importance. Rule-extraction algorithms have been proposed to solve this problem for classification by extracting comprehensible rulesets from the better performing, complex models. We present a new rule extraction algorithm for regression, based on active learning and the pedagogical approach to rule extraction. Empirical results show that improves on classical rule induction techniques, both in fidelity as well as accuracy.

Publication: available here.


Abstract: Many of the state-of-the-art data mining techniques introduce non-linearities in their models to cope with complex data-relationships effectively. Although such techniques are consistently included among the top classification techniques in terms of predictive behaviour, their lack of transparency renders them useless in any domain where comprehensibility is of importance. Rule-extraction algorithms remedy this by distilling comprehensible rulesets from complex models that explain how the classifications are made. The present article considers a new pedagogical rule extraction technique, based on active learning (ALPA). The technique generates artificial data points around training data with low confidence in the output score, after which these are labelled by the black-box model. The main novelty of the proposed method is that it uses a pedagogical approach without making any architectural assumptions of the underlying model. It can therefore be applied to any black-box technique. Furthermore, ALPA can generate any rule format, depending on the chosen underlying rule induction technique. We validated these claims in an empirical study using combinations of popular data mining algorithms (SVM, ANN, Random Forest, C4.5, Ripper). Our results show that not only do the ALPA generated rules explain the black-box models well, the proposed algorithm also performs substantially better than traditional rule induction techniques in terms of accuracy

Publication: Under review.

Rule Evaluation for Metaheuristic-based Sequential Covering Algorithms

Abstract : While many papers propose innovative methods for constructing individual rules in separate-and-conquer rule learning algorithms, comparatively
few study the heuristic rule evaluation functions used in these algorithms to ensure that the selected rules combine into a good rule set. Underestimating the impact of this component has led to suboptimal design choices in many algorithms. The main goal of this paper is to demonstrate the importance of heuristic rule evaluation functions by improving existing rule induction techniques and to provide guidelines for algorithm designers. We first select optimal heuristic rule learning functions for several metaheuristic-based algorithms and empirically compare the resulting heuristics across algorithms. This results in large and significant improvements of the predictive accuracy for two techniques. We find that despite the absence of a global optimal choice for all algorithms, good default choices seem to exist for families of algorithms. A near-optimal selection can thus be found for new algorithms with minor experimental tuning. A major contribution is made towards balancing a model’s predictive accuracy with its comprehensibility, as the parametrized heuristics offer an unmatched flexibility when it comes to setting the trade-off between accuracy and comprehensibility.

Publication: available here.


Abstract : Support vector machines (SVMs) are currently state-of-the-art for the classification task and, generally speaking, exhibit good predictive performance due to their ability to model nonlinearities. However, their strength is also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. In this research, we propose a new Active Learning-Based Approach (ALBA) to extract comprehensible rules from opaque SVM models. Through rule extraction, some insight is provided into the logics of the SVM model. ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules. This performance increase, both in terms of predictive accuracy as comprehensibility, is confirmed in our experiments where we apply ALBA on several publicly available data sets.

Publication: available here.