Identifying Critical Neurons in ANN Architectures using Mixed Integer ProgrammingPublications ·
Download the paper
Check the code
In this research, we attempt to understand the neural model architecture by computing an importance score to neurons. The computed importance score can be used to prune the model or to understand which features are more meaningful to the trained ANN (artificial neural network). The importance score is computed using Mixed Integer Programming (proposed method infographic).
What’s Mixed Integer Programming (MIP) ?
Mixed Integer Programming is a combinatorial optimization problem restricted to discrete decision variables, linear constraints and linear objective function. The MIP problems are NP-complete, even when variables are restricted to only binary values. The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous variables are included, they are designated by mixed integer programs.
The solver tries to search the space of possible solutions based on the set of input constraints. In order to narrow down the set of possible solutions, the solver uses an objective function to select the optimal solution.
Proposed Framework diagram
- First we start by training the neural model to be sparsified on the dataset .
- After training, we feed a batch of images (an image per class) to the MIP, and we compute a lower and upper bound at each layer for each input image. and
- We create a set of liner constraints representing the neural model with integer variable used for ReLU activation constraint.
- We define the objective used by the MIP to sparsify more neurons with marginal loss in predictive accuracy.
- After running the solver and computing the neuron importance score, any neuron below a threshold (0.1) is pruned and set to zero without fine-tuning.
In the following sections, we will explain in depth the constraints used, the objective and the generalization technique (to use masks computed by a dataset on another dataset).
We will propagate input images through each layer in the ANN, each image having 2 perturbed values and denoting initial upper and lower bound.
In our work, we use tight value to approximate the input trained model and to narrow the search space used by the solver.
We define the following decision variables :
- -> decision variable defining value of input logit to layer
- -> binary decision variable as a gate for ReLU activation function (this variable can be relaxed to a continuous value for faster execution)
- -> decision variable holding the neuron importance score at layer with neuron index
The following parameters are used :
- -> fixed weights of layer
- -> bias of layer
- -> upper bound of output logit of layer
- -> lower bound of output logit of layer
Now we define the constraints used :
Equality between input batch to the MIP and the decision variable holding input to the first layer. For ReLU activated layers, the output logit is always positive.
Defining gating decision variable used for ReLU
Upper bound of current layer’s logit output using gating variable.
The above constraints contain the decision gating variable that is choosing whether to enable the logit output or not along with the neuron importance score .
The neuron importance score is defined in the range of [0-1].
Our first objective is to maximize number of neurons sparsified from the trained model. The more neurons having zero importance score the better for this objective. Let’s define a set of notation to make the equations easier :
- set of neurons at layer .
- be the sum of neuron scores at layer having neurons scaled down to range .
- Let In order to create a relation between neuron scores in different layers, our objective becomes the maximization of the amount of neurons sparsified from layers having higher score .
Our second objective is to select which neurons are non-critical to the current model’s predictive task. For that objective, we use a simple marginal softmax computed on the true labels of the input batch of images.
In marginal softmax, the loss focus more on the predicted labels values without relying on the value of the logit.
The solver’s logit value will be weighted by the importance score of each neuron and will be different from the true one.
Where index stands for the class label.
We define used to give more weight to the predictive capacity of the model. The is multiplied by the marginal softmax.
The larger the value of the lambda the less pruned parameters.
In this experiment, we show that the computed neuron importance score on dataset with specific initialization can be applied on another unrelated dataset achieving good results.
- We train the model on dataset .
- We compute its neuron importance score using MIP
- After computing the neuron importance score, we create a masked neural model with neurons below certain threshold removed.
- We restart the model to its initialization while using its masked version.
- We train the masked version on .
We sparsify the model and compute its predictive accuracy on the equivalent dataset’s test set. Steps:
- We start by removing non-critical neurons having a score below certain threshold (0.1 in fully connected models and 0.2 in convolutional models) and then we evaluate the masked model (Ours).
- We remove random neurons from the model and evaluate for comparison purposes with our method (RP).
- We remove top scoring neurons denoted by the MIP from the model (CP).
- Another experiment, we try to fine tune our technique for 1 epoch to recover the lost predictive accuracy and in some cases it surpasses the original non-masked model (Ours + fine-tuning).
All the above pruning ways are having same number of removed parameters as removing non-critical neurons.
|Ref||98% ± 0.13||97.6% ± 0.11||98.8% ± 0.09||87.7% ± 0.6||87.7% ± 0.4||89.5% ± 0.3||72% ± 0.6|
|RP.||98% ± 0.13||83.2% ± 4.5||89.9% ± 3.6||37.6% ± 4.3||28.15% ± 4.5||82.8% ± 3.4||62.4% ± 0.8|
|CP.||75.7% ± 5.3||59.3% ± 17.4||75.7% ± 5.3||13% ± 1.6||12.8% ± 2.7||75.2% ± 3.6||62.2% ± 1.4|
|Ours||97.2% ± 0.6||97% ± 0.4||97.7% ± 0.7||86.9% ± 0.8||86.9% ± 0.7||84.8% ± 2.3||66.5% ± 1.8|
|Ours + FT||97.8% ± 0.2||97.8% ± 0.12||98.9% ± 0.12||88.2% ± 0.18||87.7% ± 0.5||89.1% ± 0.8||68.90%|
|PRUNE (%)||35.5% ± 4.5||30.9% ± 10.2||23.1% ± 5.8||58.1% ± 6.8||43.8% ± 14.2||14.2% ± 3.5||11.3% ± 1.4|
The solving time is in terms of second, but if we relax to be continous instead of binary the solver will take less than a second (relaxation of teh constraints).
The masked version of the model was computed on MNIST dataset, these results shows that computed sub-network on a dataset can be applied and generalized to another dataset (with no statistical relation between both datasets).
We have discussed the constraints used in the MIP formulation and verified the score produced by the MIP by a set of experiments on different types of architectures and datasets. Also our approach was able to generalize on different datasets, when we used same sparsified model produced by MNIST dataset and retrained it using Fashion-Mnist it gave same result to the original model.
Check the paper for more theoretical details.
- Hartmanis, Juris. “Computers and intractability: a guide to the theory of NP-completeness (michael r. garey and david s. johnson).” Siam Review 24.1 (1982): 90.
- Gimpel, Kevin, and Noah A. Smith. “Softmax-margin CRFs: Training log-linear models with cost functions.” Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.