Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming

Download the paper

Check the code

In this research, we attempt to understand the neural model architecture by computing an importance score to neurons. The computed importance score can be used to prune the model or to understand which features are more meaningful to the trained ANN (artificial neural network). The importance score is computed using Mixed Integer Programming (proposed method infographic).

What’s Mixed Integer Programming (MIP) ?

Mixed Integer Programming is a combinatorial optimization problem restricted to discrete decision variables, linear constraints and linear objective function. The MIP problems are NP-complete, even when variables are restricted to only binary values. The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous variables are included, they are designated by mixed integer programs.

The solver tries to search the space of possible solutions based on the set of input constraints. In order to narrow down the set of possible solutions, the solver uses an objective function to select the optimal solution.

Proposed Framework diagram

Diagram of proposed framework

In the following sections, we will explain in depth the constraints used, the objective and the generalization technique (to use masks computed by a dataset on another dataset).

MIP Constraints

Bounds propagation

We will propagate input images through each layer in the ANN, each image having 2 perturbed values and denoting initial upper and lower bound.

In our work, we use tight value to approximate the input trained model and to narrow the search space used by the solver.

ReLU Constraints

We define the following decision variables :

The following parameters are used :

Now we define the constraints used :

Equality between input batch to the MIP and the decision variable holding input to the first layer. For ReLU activated layers, the output logit is always positive.

Defining gating decision variable used for ReLU

Upper bound of current layer’s logit output using gating variable.

The above constraints contain the decision gating variable that is choosing whether to enable the logit output or not along with the neuron importance score .

The neuron importance score is defined in the range of [0-1].

MIP Objective


Our first objective is to maximize number of neurons sparsified from the trained model. The more neurons having zero importance score the better for this objective. Let’s define a set of notation to make the equations easier :

Marginal Softmax

Our second objective is to select which neurons are non-critical to the current model’s predictive task. For that objective, we use a simple marginal softmax computed on the true labels of the input batch of images.

In marginal softmax, the loss focus more on the predicted labels values without relying on the value of the logit.

The solver’s logit value will be weighted by the importance score of each neuron and will be different from the true one.

Where index stands for the class label.

Multi-Objective (MIP)

We define used to give more weight to the predictive capacity of the model. The is multiplied by the marginal softmax.

The larger the value of the lambda the less pruned parameters.

Plot of lambda vs test accuracy on Fashion-MNIST with LeNet model



Diagram of generalization experiment

In this experiment, we show that the computed neuron importance score on dataset with specific initialization can be applied on another unrelated dataset achieving good results.

Steps :


We sparsify the model and compute its predictive accuracy on the equivalent dataset’s test set. Steps:

All the above pruning ways are having same number of removed parameters as removing non-critical neurons.


MODEL FC-3 FC-4 LeNet FC-3 FC-4 LeNet  
Ref 98% ± 0.13 97.6% ± 0.11 98.8% ± 0.09 87.7% ± 0.6 87.7% ± 0.4 89.5% ± 0.3 72% ± 0.6
RP. 98% ± 0.13 83.2% ± 4.5 89.9% ± 3.6 37.6% ± 4.3 28.15% ± 4.5 82.8% ± 3.4 62.4% ± 0.8
CP. 75.7% ± 5.3 59.3% ± 17.4 75.7% ± 5.3 13% ± 1.6 12.8% ± 2.7 75.2% ± 3.6 62.2% ± 1.4
Ours 97.2% ± 0.6 97% ± 0.4 97.7% ± 0.7 86.9% ± 0.8 86.9% ± 0.7 84.8% ± 2.3 66.5% ± 1.8
Ours + FT 97.8% ± 0.2 97.8% ± 0.12 98.9% ± 0.12 88.2% ± 0.18 87.7% ± 0.5 89.1% ± 0.8 68.90%
PRUNE (%) 35.5% ± 4.5 30.9% ± 10.2 23.1% ± 5.8 58.1% ± 6.8 43.8% ± 14.2 14.2% ± 3.5 11.3% ± 1.4

The solving time is in terms of second, but if we relax to be continous instead of binary the solver will take less than a second (relaxation of teh constraints).

Generalization Results

Dataset Ref. Masked Pruning (%)
Fashion-MNIST 89.50% 88.82% 20.04%
    89.19 15.50%
CIFAR-10 72.30% 66.50% 19.80%
    70.86% 15.15%

The masked version of the model was computed on MNIST dataset, these results shows that computed sub-network on a dataset can be applied and generalized to another dataset (with no statistical relation between both datasets).


We have discussed the constraints used in the MIP formulation and verified the score produced by the MIP by a set of experiments on different types of architectures and datasets. Also our approach was able to generalize on different datasets, when we used same sparsified model produced by MNIST dataset and retrained it using Fashion-Mnist it gave same result to the original model.

Check the paper for more theoretical details.