# LLpowershap: logistic loss-based automated Shapley values feature selection method | BMC Medical Research Methodology

Like powershap, *LLpowershap* builds on the idea that a known random feature should have on average lower impact on predictions than an informative feature, but extends the idea in two ways. Firstly, we utilise loss-based Shapley (LogisticLossSHAP) values as it considers the loss instead of model prediction, which can have an impact when a model does not predict the output well. Moreover, we calculate LogisticLossSHAP values on unseen data instead of training/validation sets. Below we demonstrate how LogisticLossSHAP values calculated on the test set can be beneficial in not reporting attributions that lean towards the noise present in the training set rather than the true underlying patterns, especially when using highly flexible (low bias) machine learning algorithms such as gradient boosting.

Let \(\varvecx_i\) be the feature vector for a specific training example, *i*, with \(y_i\) as the label. Assume that \(y_i\) is erroneously recorded. Let \(\varvecx_k\) be the feature vector for test example *k* which is identical to \(\varvecx_i\) (that is, \(\varvecx_i = \varvecx_k\)) but has the true label, \(y_k\) (that is \(y_i \ne y_k\)). Let \(f(\varvecX)\) be the trained model and assume that it greatly overfits the training data as shown in Fig. 1, demonstrating that the generalisation performance of a model is related to its prediction capability on independent test data [18]. Further, assume that \(f(\varvecX)\) uses noise feature *j* for overfitting the training example *i*. We know that predictive Shapley value for feature *j*, which measures the contribution of feature *j* to the model’s prediction for the training example *i* is

$$\beginaligned \phi _j, \text predictive(\varvecx_i) = \sum _S \subseteq N \setminus \j\ \fracSN \left( f_i(S \cup \j\) – f_i(S) \right) , \endaligned$$

(3)

where *N* is the set of all features and \(f_i\) is \(f(\varvecX)\) evaluated at example *i*. The logistic loss Shapley value that incorporates the label \(y_i\) and measure the contribution of feature *j* in reducing the logistic loss for the training example *i* will be

$$\beginaligned \phi _j, \mathrmlogistic\_loss(\varvecx_i) = \sum _S \subseteq N \setminus \j\ \fracS! \left( L( f_i(S \cup \j\), y_i) – L( f_i(S), y_i) \right) , \endaligned$$

(4)

where the logistic loss *L* is defined as \(L(f(\textbfx), y) = -[y \log (f(\textbfx)) + (1 – y) \log (1 – f(\textbfx))].\)

For the training example *i*, \(\phi _j, \text predictive(\varvecx_i)\) will be high because feature *j* has a strong contribution to the model’s prediction, and \(\phi _j, \mathrmlogistic\_loss(\varvecx_i)\) will also be high because feature *j* reduces the logistic loss for the example. For the identical test example *k*, the predictive Shapley value for feature *j* is

$$\beginaligned \phi _j, \text predictive(\varvecx_k) = \sum _S \subseteq N \setminus \j\ \frac-N \left( f_k(S \cup \j\) – f_k(S) \right) . \endaligned$$

(5)

Since \(\varvecx_k = \varvecx_i\), the predictive Shapley value \(\phi _j, \text predictive(\varvecx_k)\) will be high and will be equal to \(\phi _j, \text predictive(\varvecx_i)\), irrespective of the truth label. Predictive Shapley values suffer from such overfitting scenarios and hence BorutaShap, shapicant and powershap as well. For the test example *k* with the true label \(y_k\), the logistic loss Shapley value is

$$\beginaligned \phi _j, \mathrmlogistic\_loss(\varvecx_k) = \sum _S \subseteq N \setminus \j\ \fracS! \left( L( f_k(S \cup \j\), y_k) – L( f_k(S), y_k) \right) . \endaligned$$

(6)

Since \(y_k \ne y_i\), and the model overfits the training example *i* using feature *j*, the logistic loss for the test example *k* will reflect this discrepancy. Consequently, \(\phi _j, \mathrmlogistic\_loss(\varvecx_k)\) will be low indicating either no contribution or contribution in the opposite direction compared to the training example. Our observations show that LogisticLossSHAP values calculated on test sets have lower Shapley values for noise features compared to the values calculated on the training sets, that is, the strength of the noise features is further attenuated by calculating Shapley value on unseen data. On our simulation data, this attenuation for 100 iterations ranged roughly from 1.5 to 4 times, whereas we did not observe such attenuation in informative features.

Secondly, substantiated by the observations (for example, from 0 noise features to 15 noise features in the output) that using a single noise feature results in greater number of noise outputs when LogisticLossSHAP values are used, a new powerful set of LogisticLossSHAP values is created by taking the maximum of different noises of different standard distributions in each iteration. As expected, on the simulation datasets, statistical analyses show that this newly created set of LogisticLossSHAP values has higher influence than the individual noises. We insert five different noise features: uniform, normal, exponential, logistic and Cauchy standard distributions. Using this set of values is somewhat similar to the strategy followed in BorutaShap, but in BorutaShap the shadow features are created for every input feature by randomly shuffling the input features and picking up the best performing shadow feature. Creating shadow features for all features can result in time and space complexity issues when training data is very large. In principle, it may also be possible to employ noises of the same distribution type such as uniform distribution in *LLpowershap*, but the current strategy is more aligned with the shadow feature strategy of BorutaShap, that the real features often exhibit more specific distributions. For example, in large datasets, income of participants often exhibits log-normal characteristic with decrease in frequencies as the income levels go up, whereas measures such as height, weight and blood pressure of participants usually follow a normal distribution.

Positive and negative Shapley values of a feature for prediction indicate the direction and magnitude of impact of the feature to reach the predicted value from a global baseline value. Both higher positive and negative Shapley values are valuable for prediction. Unfortunately, tree-based models sometimes use noise features in creating decisions trees. Also, tree explainers sometimes provide non-zero Shapley values to noise features [19], making it less attractive to use zero-based threshold on mean absolute local Shapley values to select informative features. In the case of Shapley values calculated for explaining logistic loss, positive and negative values have different meanings. Positive Shapley values increase the loss and negative values reduce the loss. We negate Shapley values for easier understanding. Again, due to the issues aforementioned, zero-based threshold is not ideal. We do not take the absolute of LogisticLossSHAP values as it is not appropriate.

In Algorithm 1, we show how Shapley values are calculated in *LLpowershap*, which is a modification of the algorithm proposed by Verhaeghe et al. [4]. The key differences are (a) using LogisticLossSHAP instead of SHAP values, (b) splitting the whole training samples at the proportion of 0.7 and 0.1 for training and validation and 0.2 for the test set (to calculate LogisticLossSHAP values on unseen data) and (c) employing different noise features and taking the maximum in each iteration.

Powershap uses a percentile formula to calculate p-value empirically, ranking the mean Shapley value for a feature within the Shapley values for the random uniform feature. In contrast, for each iteration we consider only the noise feature that has had the highest influence. To empirically calculate p-value we consider the distribution of Shapley values of both the highest noise and the feature as non-parametric and employ the Mann-Whitney U test. The null hypothesis here is that there is no difference between the distributions and the alternative hypothesis is that Shapley value distribution of the feature is greater than that of the noise. This results in getting a p-value for each feature. Then using a user supplied p-value cut-off \(\alpha\), we can find the set of informative features. We use scipy.stats.mannwhitneyu [20] to calculate p-values. Algorithm 2 shows the modifications to calculate p-values.

### Automatic mode

We utilise the powerful automatic mode designed in [4] with the following changes. We set the initial number of iterations to 20 instead of 10 iterations which helps to estimate more accurate p-values. The other change is in the calculation of the effect size. We test the equality of variance using Levene’s test and if there is no difference in variances to the user provided \(\alpha\) value, we use pooled standard deviation (Eqs. 7 and 8). Where variances are not equal, we use Glass’s delta (Eq. 9, using the standard deviations of Shapley values of features) in estimating the effect size. We find that, generally, for the LogisticLossSHAP values, the augmented noise features have much smaller means (roughly 50 to 1,000 times less) and standard deviations (5 to 100 times less) compared to informative features in our simulation data. We assume t-distributions (our observations on the simulation data show this is a reasonable assumption) both for the noise and informative features for empirical power calculations. Algorithm 3 shows the power analysis. We do not make any changes to the following aspects of the automatic mode of powershap. That is, the default \(\alpha\) is set to 0.01 and the required power to 0.99. If the number of required iterations estimated using the power calculation exceeds 10, an additional 10 iterations are added and p-values and required iterations are again calculated. By default, this is repeated three times to avoid a possible infinite calculation. We provide source files that can be used to replace certain existing files in powershap [4] to make use of *LLpowershap*.

$$\beginaligned s_p = \sqrt\fracs_1^2 + s_2^22 \endaligned$$

(7)

$$\beginaligned effect\_size = \frac\mu (s_1) – \mu (s_2)s_p \endaligned$$

(8)

$$\beginaligned effect\_size = \frac\mu (s_1) -\mu (s_2)s_2 \endaligned$$

(9)

link