A distinction between aleatoric and epistemic uncertainties is proposed in the domain of medical decision-making (Senge et al., 2014). Their paper explained that aleatoric and epistemic uncertainties are not distinguished in Bayesian inference. Moreover, the expectation over the model with respect to the posterior is used to get our prediction leading to an averaged epistemic uncertainty. For that limitation, they proposed a framework that can distinguish between aleatoric and epistemic uncertainty and predicts the plausibility of a prediction for each class. From the plausibility measure per class, aleatoric and epistemic uncertainty can be predicted.
The importance of safety in deep learning is clearly shown in (Varshney & Alemzadeh, 2016) backed by an example of a self-driving car accident leading to the driver’s death. They explain the car’s failure in scarce circumstances and emphasize the importance of predicting the epistemic uncertainty in AI-assisted systems. (Kendall & Gal, 2017) show the distinction between aleatoric and epistemic uncertainty, where they discuss the limited awareness of the neural networks’ competence. For example, experiments on image classification have shown that a trained model fails with high confidence on specifically designed adversarial attacks.
Consider a hypothesis h that delivers probabilistic predictions of outcomes y given input x. \(p_h(y|x) = p(y|x,h)\)
In Bayesian, h is supposed to be equipped with a prior distribution p(.) , and learning consists of replacing that prior with the posterior distribution. Where p(D|h) is the likelihood of h and p(.|D) captures the knowledge gained by the learner, hence its epistemic uncertainty.
\[p(h|D) = \frac{p(h) p(D|h)}{p(d)} \propto p(h) p(D|h)\]The peakedness of this distribution can be used to measure the uncertainty; the more peaked, the more concentrated the probability mass function in a small region in H. Similarly, the sharpness of the space is used in deep learning to predict its uncertainty around training data (Foret et al., 2020) discussed in related work.
The representation of uncertainty about a prediction is given by viewing the posterior p(h|D) under the mapping from hypothesis to probabilities of outcomes, yielding the following predictive posterior distribution
\[p(y|x) = \int_{\mathcal{H}} p(y|x,h) d p(h|D)\]In this type of Bayesian inference, a final prediction is produced by model averaging. The expectation of over the hypothesis is taken concerning the posterior distribution on H, and each hypothesis h is weighted by its posterior probability, which makes it challenging to compute.
In p\(y\|x\), aleatoric and epistemic uncertainties are not distinguished anymore because the lack of knowledge of each hypothesis (aka epistemic uncertainty) is averaged out.
Consider a data-generating process specified by a parameterized family of probability measure \(P_{\theta}\) where \(\theta\) is a parameter vector. \(f_{\theta}\) denotes the density function of \(P_{\theta}\), \(f_{\theta}(x)\) is the probability of observing \(x\) when sampling from \(P_{\theta}\).
In such a framework, the main problem is to estimate the parameter \(\theta\) using a set of observations \(D = \{X_1, \cdots, X_{N}\}\). Maximum likelihood estimation is a general principle used to tackle that problem. It prescribes estimating \(\theta\) by the maximizer of the likelihood function or, equivalently, the log-likelihood function. Assuming observations in \(D\) are independent, the log-likelihood function is given by:
\[l_{N}(\theta) = \sum_{n=1}^N \log f_{\theta} (X_n)\]An important result of mathematical states that $\theta$ would converge to a normal distribution \(\hat{\theta}\) as \(N \rightarrow \infty\). More specifically, \(\sqrt{N} (\hat{\theta} - \theta)\) converges to a normal distribution with mean 0 and covariance matrix \(\mathcal{I^{-1}}(\theta)\), where
\[\mathcal{I}_{N}(\theta) = - \left[ \mathbf{E}_{\theta} (\frac{\partial l_{N}}{\partial \theta_i \partial \theta_j}) \right]\]where \(\mathbf{E}_{\theta}\) denotes expectation under the distribution of the random variable when \(\theta\) is fixed.
Gaussian processes (Seeger, 2004) can be seen as a generalization of the Bayesian approach from an inference about multivariate random variables to an inference about functions. Thus, they can be seen as a distribution over random functions.
More specifically, a stochastic process in the form of a collection of random variables is said to be drawn from a Gaussian process with mean function \(m\) and covariance function \(k\), denoted \(f \sim \mathcal{GP}(m, k)\), if for any finite set of elements \(\vec{x}_1, \ldots , \vec{x}_m \in \mathcal{X}\), the associated finite set of random variables \(f(\vec{x}_1), \ldots , f(\vec{x}_m)\) has the following multivariate normal distribution: \(\left[ \begin{matrix} f(\vec{x}_1) \\ f(\vec{x}_2) \\ \vdots \\ f(\vec{x}_m) \end{matrix} \right] \sim \mathcal{N} \left( \left[ \begin{matrix} m(\vec{x}_1) \\ m(\vec{x}_2) \\ \vdots \\ m(\vec{x}_m) \end{matrix} \right] , \left[ \begin{matrix} k(\vec{x}_1, \vec{x}_1) & \cdots & k(\vec{x}_1, \vec{x}_m) \\ \vdots & \ddots & \vdots \\ k(\vec{x}_m, \vec{x}_1) & \cdots & k(\vec{x}_m, \vec{x}_m)\end{matrix} \right] \right)\)
Intuitively, a function \(f\) drawn from a Gaussian process prior can be thought of as a (very) high-dimensional vector drawn from a (very) high-dimensional multivariate Gaussian. Here, each dimension of the Gaussian corresponds to an element \(\vec{x}\) from the index set \(\mathcal{X}\), and the corresponding component of the random vector represents the value of \(f(\vec{x})\).
Gaussian processes allow for making proper Bayesian inference in a non-parametric way: Starting with a prior distribution on functions \(h \in \mathcal{H}\), specified by a mean function \(m\) and kernel \(k\), this distribution can be replaced by a posterior in light of observed data \(\mathcal{D} = \{ (\vec{x}_i , y_i ) \}_{i=1}^N\), where an observation \(y_i = f(\vec{x}_i) + \epsilon_i\) could be corrupted by an additional noise component \(epsilon_i\). Likewise, a posterior predictive distribution can be obtained on outcomes \(y \in \mathcal{Y}\) for a new query \(\vec{x}_{q} \in \mathcal{X}\).
Problems with discrete outcomes \(y\), such as binary classification with \(\mathcal{Y} = \{ -1, +1 \}\), are made amenable to Gaussian processes by suitable link functions, linking these outcomes with the real values \(h = h(\vec{x})\) as underlying (latent) variables. For example, using the logistic link function \(P( y | h) = s(h) = \frac{1}{1 + \exp( - y \, h)} \, ,\) the following posterior predictive distribution is obtained: \(P( y = +1 | X , \vec{y} , \vec{x}_{q}) = \int \sigma(h') \, P( h' | X , \vec{y} , \vec{x}_{q}) \, d \, h' \, ,\) where \(P( h' | X , \vec{y} , \vec{x}_{q}) = \int P( h' | X , \vec{x}_{q}, \vec{h})\, P( \vec{h} | X, \vec{y}) \, d \, \vec{h} \, .\) However, since the likelihood will no longer be Gaussian, approximate inference techniques (e.g., Laplace, expectation propagation, MCMC) will be needed.
In the case of regression, the variance \(\sigma^2\) of the posterior predictive distribution for a query \(\vec{x}_{q}\) is a meaningful indicator of (total) uncertainty. The variance of the error term (additive noise) corresponds to the aleatoric uncertainty, so the difference could be considered epistemic uncertainty. The latter is primarily determined by the (parameters of the) kernel function, e.g., the characteristic length-scale of a squared exponential covariance function.
The penultimate layer output of a deep neural network is treated as gaussian process (Kimin Lee, 2018). The authors proposed a loss function that clusters representations of different classes into k-means friendly representations with data points evenly clustered around the centers of the clusters. Then, uncertainty can be computed as the Mahalanobis distance between the input data point and cluster centers of different classes.
A standard neural network can be seen as a probabilistic classifier \(h\): in the case of classification, given a query \(\vec{x} \in \mathcal{X}\), the final layer of the network typically outputs a probability distribution (using transformations, such as softmax) on the set of classes \(\mathcal{Y}\), and in the case of regression, a distribution (as shown in (Kimin Lee, 2018) treated as Gaussian). Training a neural network can essentially be seen as making maximum likelihood inference. As such, it yields probabilistic predictions but no information about the confidence in these probabilities. In other words, it captures aleatoric but no epistemic uncertainty.
In the context of neural networks, epistemic uncertainty is commonly understood as uncertainty about the model parameters, that is, the weights \(\vec{w}\) of the neural network. Bayesian neural networks (BNNs) have been proposed as a Bayesian extension of deep neural networks (missing reference) to capture this type of epistemic uncertainty. In BNNs, each weight is represented by a probability distribution (again, typically a Gaussian) instead of a real number, and learning comes down to Bayesian inference, i.e., computing the posterior \(P( \vec{w} | \mathcal{D})\). The predictive distribution of an outcome given a query instance \(\vec{x}_q\) is then given by \(P(y | \vec{x}_q , \mathcal{D} ) = \int P( y | \vec{x}_q , \vec{w}) \, P(\vec{w} | \mathcal{D}) \, d \vec{w} \enspace .\) Since the posteriors on the weights cannot be obtained analytically, approximate variational techniques are used (missing reference), seeking a variational distribution \(q_{\vec{\theta}}\) on the weights that minimizes the Kullback-Leibler divergence \(\operatorname{KL}(q_{\vec{\theta}} \| P(\vec{w} | \mathcal{D}))\) between \(q_{\vec{\theta}}\) and the true posterior.
Dropout variational inference establishes a connection between Bayesian inference and using dropout during inference, as proposed (Gal & Ghahramani, 2016). Dropout is used as a regularization technique during deep neural network training by randomly disabling some connections. (Gal & Ghahramani, 2016) suggest enabling dropout during inference, and by doing multiple forward passes over randomly disabled connections before each layer, we can predict an empirical Gaussian distribution. The empirical distribution or parameter estimates can then be used to obtain, for example, a mean value and a confidence measure in terms of the distributional variance. We expect that the empirical variance is low where training data were abundant since all network subsets had the opportunity to learn in these areas. However, the network behavior is not controllable in areas with no training data to train on, so we expect a high variance among the different network subsets.
One of the main drawbacks of this proposed technique is the need to do numerous forward passes to predict a meaningful Gaussian distribution over the input data point.
The total uncertainty contains both homoscedastic aleatoric and epistemic uncertainty for the training data points. The aleatoric uncertainty can be heteroscedastic, and we have a different uncertainty estimate for every input point. For that reason, (Kendall & Gal, 2017) proposed loss attenuation that predicts a mean and variance for every input data point.
Assume that our model output has a single \(y\) prediction which is not deterministic but normally distributed with parameters \(\mu(x), \sigma^2(x)\), depending on \(x\) input. We must consider the distributional variance in the training process instead of using the mean squared error. We take the negative log-likelihood function of the normal distribution as a loss function, ignoring constants, \(\mathcal{L}(x,y) = -\log \Phi(y|x) =0.5 \log(\hat{\sigma^2(x)}) + \frac{(y - \hat{\mu})^2}{2 \hat{\sigma^2(x)})}\)
Intuitively, the numerator of the second term in the negative log-likelihood function encourages the mean prediction \(\mu(x)\) to be close to the observed data. At the same time, the denominator makes sure the variance \(\sigma^2(x)\) is significant when the deviation from the mean \((y - \hat{\mu})^2\) is large. The first term is a counterweight for the variance not to grow indefinitely. This strategy can capture the aleatory uncertainty of our data-generating process in areas of sufficient training data. However, training a deep neural network with loss attenuation is often tricky and quickly gets into the Nans’ trap.
Bayesian model averaging establishes a natural connection between Bayesian inference and ensemble learning. Indeed, the variance of the predictions produced by an ensemble is a good indicator of epistemic uncertainty. The variance is inversely related to the “peakedness” of a posterior distribution \(P(h \| \mathcal{D})\). It can be seen as a measure of the discrepancy between the (most probable) candidate hypotheses in the hypothesis space. Based on this idea, (Lakshminarayanan et al., 2017) proposes a simple ensemble approach as an alternative to Bayesian DNNs, which is easy to implement, readily parallelizable, and requires little hyperparameter tuning.
Instead of training a single network, we will train an ensemble of \(M\) networks with different random initializations. While we expect all networks to behave similarly in areas with sufficient training data, the results will be completely different where no data is available. For a final prediction, we now take all networks and combine their results into a Gaussian mixture distribution from which we can, again, extract single mean and variance estimations.
\[\begin{aligned} \hat{\mu}_c(x) &= \frac{1}{M} \sum_{i=1}^M \hat{\mu}_i(x) \\ \hat{\sigma}_c^2(x) &= \frac{1}{M} \sum_{i=1}^M \hat{\sigma}_i^2(x) + \left[ \frac{1}{M} \sum_{i=1}^M (\hat{\mu}_i^2(x) - \hat{\mu}_c^2(x)) \right] \end{aligned}\]This variance prediction even clearly distinguishes between the two types of uncertainty. The first term, denoting an average of all variance estimates, can be interpreted as aleatory uncertainty. The remaining term can be considered epistemic uncertainty, which is low if all mean estimates agree on a similar value and grows if the mean estimates differ widely.
In summary, uncertainty estimation in deep learning is crucial for ensuring the safety and reliability of AI systems, particularly in scenarios where human lives are at stake, such as autonomous driving. By distinguishing between aleatoric and epistemic uncertainties and employing Bayesian inference techniques, researchers can develop models that not only make accurate predictions but also quantify the level of uncertainty associated with each prediction. Additionally, methods such as maximum likelihood estimation and Fisher information provide valuable tools for estimating model parameters and understanding the underlying data-generating process. Overall, integrating uncertainty estimation techniques into deep learning models is essential for building robust and trustworthy AI systems.
In the ever-evolving landscape of artificial intelligence, there’s a pressing need to cultivate diverse talent and foster inclusivity in STEM fields. One initiative that has personally enriched my journey and allowed me to contribute meaningfully to this cause is MISE (Machine Learning in Science and Engineering), a transformative program dedicated to equipping high school students from underprivileged backgrounds with the skills and knowledge needed to thrive in the world of AI.
In 2022, I embarked on a journey with MISE along with Prudencio that would leave an indelible mark on both my career and the lives of the students I had the privilege to mentor. Over the span of two immersive weeks in Ghana, I had the honor of guiding a group of eager learners through the intricacies of machine learning and research methodologies.
From introducing foundational concepts like statistics to exploring cutting-edge techniques in deep learning, each session was designed to ignite curiosity and spark creativity. Witnessing the students’ eyes light up as they grappled with complex ideas and applied their newfound knowledge to real-world problems was nothing short of inspiring.
Among the remarkable individuals I had the pleasure of mentoring, one student’s journey stands out as a testament to the transformative power of education and mentorship. Joel Manu, a determined and driven participant in the MISE program, embraced every challenge with enthusiasm and determination.
Under my guidance, Joel Manu embarked on a research project that not only showcased his intellectual curiosity but also propelled him to new heights of achievement. His project, focusing on out of distribution detection in deep learning, not only earned him accolades at the International Science Expo for Young Scientists but also caught the attention of admissions officers at MIT.
I am thrilled to share that Joel Manu’s dedication and hard work paid off, as he secured admission to MIT this year—a testament to his resilience, tenacity, and unwavering commitment to excellence.
While Joel Manu’s success story is undeniably inspiring, the impact of the MISE program extends far beyond individual achievements. It represents a beacon of hope and opportunity for countless students who dare to dream of a future in STEM.
As I reflect on my journey with MISE, I am reminded of the profound impact that mentorship and education can have on shaping the future of AI and beyond. It is a reminder of our collective responsibility to nurture talent, cultivate diversity, and champion inclusivity in every facet of our work.
The journey with MISE has reinforced my belief in the transformative power of education and mentorship. It has ignited a passion within me to continue advocating for diversity and inclusion in STEM and to empower the next generation of leaders and innovators.
I invite you to join me in this noble pursuit. Whether through mentorship initiatives, educational programs, or advocacy efforts, each of us has the power to make a difference in the lives of aspiring scientists and engineers around the world.
Together, let’s create a future where opportunity knows no bounds and where every voice is heard, valued, and celebrated.
I wanted to take a moment to express my deepest gratitude for the warm welcome and exceptional accommodation provided during my recent participation in the MISE program with a special thanks to Joel Dogoe (Director of MISE program).
In this paper (Unterthiner et al., 2020) showed empirically that we can predict the generalization gap of a neural network by only looking at its weights. In this work, they released a dataset of 120k convolutional neural networks trained on different datasets.
The authors extracted a set of statistics features from the network’s weights and fed it to an estimator \(\hat{F}\) to predict the generalization gap(check Predict the generalization gap using marginal distribution).
The proposed framework relies on the encoding of the training dataset features into the network’s weights \(W_1 \cdots W_L\). Using only the encoded features into the weights, we can extract information about how certain the network is on its encoded feature. Hence, we can use it as a feature vector fed to the estimator \(\hat{F}\). The authors tried to investigate whether the hyper-parameters could be used only as a feature vector to our estimator.
They found that there exists a mapping from the hyper-parameters used during the training to the generalization gap if we fix the random seed and the training set. Also, Unterthiner et al. (2020) tried to train the proposed estimator on a set of CNNs trained on a single dataset and tested the estimator on another set of CNNs trained on a different dataset. They called this setting the domain shift.
Now let’s start by showing the set of features extracted. First problem would be to combine the features from variable depth DNNs . The authors proposed to extract some statistics from only the input and the top three hidden layers, which showed better results than statistics from flattening the whole set of weights.
The statistics \(\tilde{W}\) per layer includes the mean, variance and qth percentile \(q \in \{0, 25, 50, 75, 100\}\). Extracting these statistics for both kernels and biases for the first four layers would result in a feature vector of \(4 \times 2 \times 7 = 56\) real values. Finally, the authors fed the extracted features into a gradient boosting model and showed promising results on predicting the generalization gap using domain shift setting and different architectures in the train and test set.
From this paper, the main insight would be the possibility of extracting some meaningful information about the training set like its uncertainty and generalization gap using simply the neural network’s trained weights.
I would recommend reading the paper itself and checking the related work, this is just a summary to give you a rough idea of what is going on.
In this paper (Jiang et al., 2018), they discuss a method that can predict the generalization gap from trained deep neural networks. The authors used marginal distribution information from input training set as a feature vector used by an estimator to get the generalization gap.
The authors used marginal bounds extracted from each data point in the training set to represent a marginal distribution. The extracted marginal distribution used as input feature to train an estimator on the generalization gap. Next we explain the generalization and the marginal distribution used in predicting it.
The generalization gap can be defined as the difference in terms of DNN’s performance between training set and testing set. This difference comes from the distributional shift between collected training data and real test data.
We start by explaining the concept of the marginal bound.
The marginal bound is the distance between the data point and the decision boundary (Elsayed et al., 2018). In other words, the marginal bound is the smallest displacement of the data point that results in a score tie between the top two classes i and j. In this work, they only consider positive marginal bounds (only correctly predicted training data points.) \(D_{(f, x, i, j)} \triangleq \{x | f_i(x) = f_j(x)\}\) \(D_{(f, x, i, j)} \triangleq min_{\delta} ||\delta||_p s.t. f_i(x + \delta) = f_j(x + \delta)\)
In SVM we can compute the distance of the data point to the decision boundary. However, in DNNs, it is intractable to compute the exact distance to the decision boundary. a first order Taylor approximation is used to approximate the distance to the decision boundary. Then, the distance of the data point to the decision boundary at layer (l) is denoted by: \(D_{(f, x, i, j)}(x^l) = \frac{f_i(x^l) - f_j(x^l)}{||\nabla_{x^l}f_i(x^l) - \nabla_{x^l} f_j(x^l)||_2}\)
Here \(f_i(x^l)\) represents the output logit at class \(i\) for input representation \(x^l\) at layer \(l\). The distance can be simply negative, if its corresponding data point is on the correct or wrong side of the decision boundary.
If we use statistics properties on the marginal bound to represent the trained DNN, our computed distance will depend on the scaling factor at each layer (in case it is multiplied or divided by a constant). Hence, we need to normalize the marginal bound of each network before feeding it to the estimator \(\hat{f}\).
In order to normalize the marginal bounds, let \(x_k^l\) be the representation vector of data point \(x_k\) at layer \(l\). We then compute the variance of each coordinate of \(\{x_k^l\}\), and then sum these individual variances. More concretely, the total variation is computed as:
\[\begin{equation} \nu (\boldsymbol{x}^{l}) = \text{tr} \Big(\frac{1}{n} \sum_{k=1}^n (\boldsymbol{x}_k^l - \bar{\boldsymbol{x}}^l) (\boldsymbol{x}_k^l - \bar{\boldsymbol{x}}^l)^T \Big) \quad , \quad \bar{\boldsymbol{x}}^l = \frac{1}{n} \sum_{k=1}^n \boldsymbol{x}_k^l \,, \end{equation}\]Using the total variation, the normalized margin is specified by: \(\begin{equation} \hat{d}_{f, (i,j)}(\boldsymbol{x}_k^l) = \frac{d_{f, (i,j)}(\boldsymbol{x}_k^l)}{\sqrt{\nu (\boldsymbol{x}^{l})}} \label{eq:tv_norm_margin} \end{equation}\)
Finally, to create a signature for the marginal distribution, the authors extract statistical properties from the margin bound of correctly predicted training data points. Given a set of distances \(\mathcal{D} = \{\hat{d}_m\}_{m=1}^n\), which constitute the margin distribution. They use the median \(Q_2\), first quartile \(Q_1\) and third quartile \(Q_3\) of the normalized margin distribution, along with the two {fences} that indicate variability outside the upper and lower quartiles. There are many variations for fences, but in this work, they used \(IQR = Q_3 - Q_1\), with upper fence to be \(\max(\{\hat{d}_m : \hat{d}_m \in \mathcal{D} \wedge \hat{d}_m \leq Q_3 + 1.5IQR\})\) and the lower fence to be \(\min(\{\hat{d}_m : \hat{d}_m \in \mathcal{D} \wedge \hat{d}_m \geq Q_1 - 1.5IQR\})\).
These 5 statistics extracted at first four layers and fed to the linear regression estimator \(\hat{f}\) result in a decent generalization gap predictor.
The dataset used in that work is provided at Demogen, and my re-implementation at Generalization gap features
Here, the authors managed to get a feature that represents the separability of representations, this feature was used as a feature vector to predict the generalization gap based only on a subset of the training set.
I would recommend reading the paper itself and checking the related work, this is just a summary to give you a rough idea of what is going on.
Mixed-Integer programs (MIP) are used in several disciplines including:
MIPs consists of an objective for the problem and a set of constraints on its decision variables. Solving a MIP is based on the branch and bound linear programming’s algorithm.
Basic branch-and-bound steps are :
After branching a variable in the resulted sub-MIP, we then create a search tree and the nodes of the tree are the MIPs generated by the search procedure. The leaves of the search tree are the MIPs that are not yet branched. Once we get an optimal solution from any node in the search tree, we then designate it as a leaf node creating a space of feasible solution from all leaf nodes.
The optimization process consists of identifying the feasible space generated by solving the LP relaxation of the search tree nodes and narrow down the set of feasible solution based on the objective function.
During the search process at the leaf nodes, the node having the best objective function value is called the best bound. The difference between any node objective’s value and the best bound’s objective value is called the gap. We can demonstrate optimal solution by having gap with zero value.
Before applying the branch-and-bound algorithm, usually the MIP solver would do a presolve step to reduce the problem. These reductions aim to reduce the size of the problem and to tighten the formulation for faster solving time. Presolve step in any MIP solver is a critical step that would reduce the LP feasible space search.
Ralph E Gomory introduced cutting planes to tighten the formulation by removing the undesired fractional solutions during the solution process.
Now to the practical part, we will use a library called CVXPY that’s a wrapper on top of multiple solvers commercial and free, making it easier to switch between solvers.
LP | QP | SOCP | SDP | EXP | MIP | |
---|---|---|---|---|---|---|
CBC | X | X | ||||
GLPK | X | |||||
GLPK_MI | X | X | ||||
OSQP | X | X | ||||
CPLEX | X | X | X | X | ||
ECOS | X | X | X | X | ||
ECOS_BB | X | X | X | X | X | |
GUROBI | X | X | X | X | ||
MOSEK | X | X | X | X | X | X |
CVXOPT | X | X | X | X | ||
SCS | X | X | X | X | X |
In the upcoming example, we create a simple MIP quadratic problem where x is the optimization variable and A,B are problem parameters
\[\begin{split}\begin{array}{ll} \mbox{minimize} & \|Ax-b\|_2^2 \\ \mbox{subject to} & x \in \mathbf{Z}^n, \end{array}\end{split}\]We start by importing CVXPY and Numpy libraries.
import cvxpy as cp
import numpy as np
Now generating the random problem using Numpy
np.random.seed(0)
m, n= 40, 25
A = np.random.rand(m, n)
b = np.random.randn(m)
Then we create CVXPY decision variable with integer flag set to true (to denote it will have an integer value)
x = cp.Variable(n, integer=True)
Now we define our objective
objective = cp.Minimize(cp.sum_squares(A @ x - b))
After that we define the problem taking as input the following params:
prob = cp.Problem(objective)
solve
function takes as input the following optional params
cp.ECOS_BB
prob.solve()
We can get the value of the solution using the following commands
print("Status: ", prob.status)
print("The optimal value is", prob.value)
print("A solution x is")
print(x.value)
I recommend checking (Diamond & Boyd, 2016; Gomory, 1969; Garey & Johnson, 1979; Agrawal et al., 2017) for more details about MIPs and CVXPY
In this research, we attempt to understand the neural model architecture by computing an importance score to neurons. The computed importance score can be used to prune the model or to understand which features are more meaningful to the trained ANN (artificial neural network). The importance score is computed using Mixed Integer Programming (proposed method infographic) (Garey & Johnson, 1979).
Mixed Integer Programming is a combinatorial optimization problem restricted to discrete decision variables, linear constraints and linear objective function. The MIP problems are NP-complete, even when variables are restricted to only binary values. The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous variables are included, they are designated by mixed integer programs.
The solver tries to search the space of possible solutions based on the set of input constraints. In order to narrow down the set of possible solutions, the solver uses an objective function to select the optimal solution. For a tutorial about mixed integer programs using python check introduction to mixed integer programming.
In the following sections, we will explain in depth the constraints used, the objective and the generalization technique (to use masks computed by a dataset on another dataset).
We will propagate input images through each layer in the ANN, each image having 2 perturbed values \(x-\epsilon\) and \(x+\epsilon\) denoting initial upper and lower bound.
In our work, we use tight \(\epsilon\) value to approximate the input trained model and to narrow the search space used by the solver.
We define the following decision variables :
The following parameters are used :
Now we define the constraints used :
\(z_0 = x\) Equality between input batch to the MIP and the decision variable holding input to the first layer. \(z_{i+1} \geq 0\) For ReLU activated layers, the output logit is always positive.
\(v_{i} \in \{0, 1\}\) Defining gating decision variable used for ReLU
\(z_{i+1} \leq v_i u_i\) Upper bound of current layer’s logit output using gating variable.
\(z_{i+1} + (1- v_i) l_i \leq W_i z_i + b_i - (1- s_i)u_i\) \(z_{i+1} \geq W_i z_{i} + b_i - (1- s_i)u_i\)
The above constraints contain the decision gating variable \(v_i\) that is choosing whether to enable the logit output or not along with the neuron importance score \(s_i\).
\(0 \leq s_{i} \leq 1\) The neuron importance score is defined in the range of [0-1]. Same constraints are used for convolutional layers with a computed importance score per feature map. We convert the convolutional feature map to a Toeplitz matrix, and the input image to a vector. Hence, allowing us to use simple matrix multiplication that can be efficiently computed, but with high memory cost.
Toeplitz Matrix is a matrix in which each value is along the main diagonal and sub diagonals are constant. So given a sequence \(a_n\), we can create a Toeplitz matrix by putting the sequence in the first column of the matrix and then shifting it by one entry in the following columns. The following figure shows the steps of creating a doubly blocked Toeplitz matrix from an input filter (Brosch & Tam, 2015).
\[\begin{pmatrix} a_0 & a_{-1} & a_{-2} & \cdots & \cdots & \cdots & \cdots & a_{-(N-1)} \\ a_1 & a_0 & a_{-1} & a_{-2} & & & & \vdots \\ a_2 & a_1 & a_0 & a_{-1} & \ddots & & & \vdots \\ \vdots & a_2 & \ddots & \ddots & \ddots & \ddots & & \vdots \\ \vdots & & \ddots & \ddots & \ddots & \ddots & a_{-2} & \vdots \\ \vdots & & & \ddots & a_1 & a_0 & a_{-1} & a_{-2} \\ \vdots & & & & a_2 & a_1 & a_0 & a_{-1} \\ a_{(N-1)} & \cdots & \cdots & \cdots & \cdots & a_2 & a_1 & a_0 \\ \end{pmatrix}.\]Feature maps or kernels at each input channel are flipped, and then converted to a matrix. The computed matrix when multiplied by the vectorized input image will provide the fully convolutional output. For padded convolution, we use only parts of the output of the full convolution, and for the strided convolutions we use sum of 1 strided convolutions as proposed by Brosch, Tom and Tam, Roger. First, we pad zeros to the top and right of the input feature map to have same size as the output of the full convolution. Then, we create a Toeplitz matrix for each row of the zero padded feature map. Finally, we arrange these small Toeplitz matrices in a large doubly blocked Toeplitz matrix. Each small Toeplitz matrix is arranged in the doubly Toeplitz matrix in the same way a Toeplitz matrix is created from the input sequence, with each small matrix as an element of the sequence.
The formulation will become \(\sum_{d=1}^{N^l}W^{(l)}_{d} h^{l-1}+b^{(l)}\) at each feature map. We then, incorporate the same constraints used for the fully connected layer repeated for each feature map.
Pooling layers are used to reduce the spatial representation of an input image by applying an arithmetic operation on each feature map of the previous layer. We model both average and max pooling on multi-input units as constraints of a MIP formulation with kernel dimensions \(ph\) and \(pw\).
Avg Pooling layer applies the average operation on each feature map of the previous layer. This operation is just linear and it can be easily incorporated into our MIP formulation: \(\begin{alignat}{3} x & = \text{AvgPool}(y_1,\ldots,y_{ph * pw} ) & = \frac{1}{ph * pw} \sum_{i=1}^{ph * pw} y_i . \end{alignat}\)
Max Pooling takes the maximum of each feature map of the previous layer. \(\begin{alignat}{3} x & = \text{MaxPool}(y_1,\ldots,y_{ph * pw} ) & = \text{max}\{y_1,\ldots,y_{ph * pw}\}. \end{alignat}\) This operation can be expressed by introducing a set of binary variables \(m_1, \ldots ,m_{ph * pw}\).
\[\begin{align} \sum_{i=1}^{ph * pw} m_i = 1 \\ x \ge y_i, \\ x \le y_i m_i+U_i(1-m_i) \\ m_i \in \{0,1\} \\ i=1, \ldots, ph * pw. \end{align}\](Fischetti & Jo, 2018) devised the max and avg pooling representation using a MIP. The max pooling representation contains a set of gating binary variables enabled only for the maximum value \(y_i\) in the set \(\{y_1, \ldots, y_{ph * pw}\}\).
Our first objective is to maximize number of neurons sparsified from the trained model. The more neurons having zero importance score the better for this objective.
Let’s define a set of notation to make the equations easier :
\(I_i = \sum_{j \in L_i} (s_{i,j} -2)\)
In order to create a relation between neuron scores in different layers, our objective becomes the maximization of the amount of neurons sparsified from layers having higher score \(I_i\).
\[\text{sparsity} = \frac{argmax_{A^{'} \in A, |A^{'}| = (n-1)} \sum_{I \in A^{'}} I}{\sum_i^{n} \vert L_i \vert}\]Our second objective is to select which neurons are non-critical to the current model’s predictive task. For that objective, we use a simple marginal softmax computed on the true labels of the input batch of images.
In marginal softmax(Gimpel & Smith, 2010), the loss focus more on the predicted labels values without relying on the value of the logit.
The solver’s logit value will be weighted by the importance score of each neuron and will be different from the true one.
\(\text{softmax} = \sum_{j \in L_n} \log[\sum_c \exp(z_{j, n, c})] - \sum_{j \in L_n} \sum_c Y_{j,c} z_{j,n,c}\) Where index \(c\) stands for the class label.
We define \(\lambda\) used to give more weight to the predictive capacity of the model. The \(\lambda\) is multiplied by the marginal softmax.
\[\text{loss} = \text{sparsity} + \lambda \cdot \text{softmax}\]The larger the value of the lambda the less pruned parameters.
In this experiment, we show that the computed neuron importance score on dataset \(d1\) with specific initialization can be applied on another unrelated dataset \(d2\) achieving good results.
Steps :
We sparsify the model and compute its predictive accuracy on the equivalent dataset’s test set. Steps:
All the above pruning ways are having the same number of removed parameters as removing non-critical neurons.
Dataset | MNIST | Fashion-MNIST | CIFAR-10 | |
---|---|---|---|---|
Architecture | LeNet-5 | VGG-16 | ||
Ref. | \(98.9\% \pm 0.1\) | \(89.7\% \pm 0.2\) | \(72.2\% \pm 0.2\) | \(\; \; 83.9\% \pm 0.4\) |
RP. | \(56.9\% \pm 36.2\) | \(33\% \pm 24.3\) | \(50.1 \% \pm 5.6\) | \(\; \; 85\% \pm 0.4\) |
CP. | \(38.6\% \pm 40.8\) | \(28.6\% \pm 26.3\) | \(27.5\% \pm 1.7\) | \(\; \; 83.3\% \pm 0.3\) |
Ours | \(\mathbf{98.7\% \pm 0.1}\) | \(\mathbf{87.7\% \pm 2.2}\) | \(\mathbf{67.7\% \pm 2.2}\) | N/A |
Ours + ft | \(\mathbf{98.9\% \pm 0.04}\) | \(\mathbf{89.8\% \pm 0.4}\) | \(\mathbf{68.6\% \pm 1.4}\) | \(\; \; \mathbf{85.3\% \pm 0.2}\) |
Prune (\%) | \(17.2\% \pm 2.4\) | \(17.8\% \pm 2.1\) | \(9.9\% \pm 1.4\) | \(\; \; 36\% \pm 1.1\) |
threshold | \(0.2\) | \(0.2\) | \(0.3\) | \(0.3\) |
The solving time is in terms of second, but if we relax \(v_i\) to be continous instead of binary the solver will take less than a second (relaxation of the constraints).
Model | Source | Target | Ref. Acc. | Masked Acc. | Pruning |
---|---|---|---|---|---|
LeNet-5 | Mnist | Fashion MNIST | \(\; \;89.7\% \pm 0.3\) | \(\; \;89.2\% \pm 0.5\) | \(\; \;16.2\% \pm 0.2\) |
CIFAR-10 | \(\; \;72.2\% \pm 0.2\) | \(\; \;68.1\% \pm 2.5\) | |||
VGG-16 | CIFAR-10 | MNIST | \(\; \;99.1\%\pm0.1\) | \(\; \;99.4\%\pm0.1\) | \(\; \;36\% \pm 1.1\) |
Fashion-Mnist | \(\; \;92.3\% \pm 0.4\) | \(\; \;92.1\% \pm 0.6\) |
The masked version of the model was computed on MNIST dataset, these results shows that computed sub-network on a dataset can be applied and generalized to another dataset (with no statistical relation between both datasets).
We have discussed the constraints used in the MIP formulation and verified the score produced by the MIP by a set of experiments on different types of architectures and datasets. Furthermore, our approach was able to generalize on different datasets.
Check the paper for more theoretical details (ElAraby et al., 2020).
In this paper (Mallya & Lazebnik, 2017), they discuss a method for adding and supporting multiple tasks in a single architecture without having to worry about catastrophic forgetting . They show in this paper that three fine-grained classification tasks can be added to a single ImageNet trained VGG-16 network with comparable accuracies to training a network for each task separately.
Lifelong learning aka continual learning is a domain where we try to create an agent able to acquire expertise on different set of tasks without forgetting previously learnt tasks.
This can be considered as general artificial intelligence as we try to create agents having the ability of humans to learn new tasks (e.g. walking, running, swimming etc…) without forgetting previously acquired tasks (futuristic).
In this setup, data from previous tasks are not seen in later tasks which would cause a catastrophic forgetting when a new task arrives to the model (which means he will get super low accuracies on previous tasks).
In this paper, it resides under the family of parameter isolation-based methods that tries to fix a set of parameters for each task. When you train a task A, it would freeze a set of weights and train the other which would ensure that your model won’t forget what he learnt for a previous arbitrary task B.
PackNet is a framework that fits a set of tasks into a single architecture by iterative masking a set of weights with marginal loss in accuracy from the first task and using them for the newly added tasks.
For example in the above illustration, We train a dense filter using data from the first Task I.
After training the dense filter we prune the model with 60% and set the weights to zero. Weights kept are selected based on its magnitude (weights with higher magnitude means they are important to the current task). After fixing the important weights for Task I, we start re-training the pruned weights on new added Task II.
Now we have a new task III, we do the same steps by pruning non-critical weights to both task II and task I weight are not considered for pruning. Using the non-critical weights to fit it on the new task III.
This work was motivated by compression techniques proposed by (Han et al., 2016), he showed that neural models are over-parameterized and that there exist a sparse sub-network that when retrained achieve same/better performance than the non-sparsified model.
In this line of work they prune the model based on the weight magnitude and re-train.
The approach consists of iteratively training on a task then pruning some of the parameters to use them for the new task without dramatically forgetting the first task.
Iterative steps applied when new task B comes while having task A:
Training phase
: They start with pre-trained ImageNet VGG-16 model, in that case ImageNet is task A.Pruning Step
: After training on task A, they remove a fixed percentage of parameters using absolute magnitude. The weights having higher magnitude means they are critical to the model. They prune lowest 50% for example of the weights . Pruning these weights would result in a loss in the model’s performance. To regain the lost accuracy, the model is retrained on task A.Fitting new task B
: we freeze the critical weights for task A selected in the previous step. Now we retrain the model using only the pruned weights from the previous steps while using previous tasks’ parameters for the forward pass.Inference step
: When performing inference for a selected task, we use parameters trained for that task along with parameters used for previous ones so the model’s state matches the one learned during training.The previous steps will be applied to new tasks arriving, but the pruning will try to prune from the weights selected only for the previous task. For example, if a task C arrive we will prune from the weights used by task B only and task A weights will remain fixed.
Tasks | Individual Networks | PackNet pruning (0.5,0.75,0.75) |
---|---|---|
Top-1 Error | ||
ImageNet | 28.42 | 29.33 |
Cubs | 22.57 | 25.72 |
Stanford Cars | 13.97 | 18.08 |
Flowers | 8.65 | 10.05 |
In their experiments, they start with VGG-16 pre-trained on Image-Net 1000 class, then the next tasks are Cubs dataset, Stanford cars dataset and flowers dataset .
The experimental setup from the paper:
In the case of the Stanford Cars and CUBS datasets, we crop object bounding boxes out of the input images and resize them to 224 × 224. For the other datasets, we resize the input image to 256 × 256 and take a random crop of size 224 × 224 as input. For all datasets, we perform left-right flips for data augmentation.
In all experiments, we begin with an ImageNet pre-trained network, as it is essential to have a good starting set of parameters. The only change we make to the network is the addition of a
new output layer per each new task
. After pruning the initial ImageNet pre-trained network, we fine-tune it on the ImageNet dataset for 10 epochs with a learning rate of 1e-3 decayed by a factor of 10 after 5 epochs.
The idea of using one shot compression techniques to choose which parameters to fit for which task is a good strategy but the drawbacks of the proposed method can be:
Training time
: each iteration of new task you have to prune and re-train on the previous task which is expensive in terms of computational timeFinite number of tasks
: what if all the weights are critical for the last task ? This means we won’t be able to prune weights for the new task without losing dramatically the predictive capacity of the model. Also, we are keeping weights from all previous tasks which limits the number of parameters we can prune.I would recommend reading the paper itself and checking the related work, this is just a summary to give you a rough idea of what is going on. Also if you are interested in continual learning check Continual learning course from Universite de Montreal
In this paper (Neyshabur et al., 2017), They introduced a framework to stabilize the GAN training by using multiple projections with fixed filters of each input image to a different discriminator. Training GAN models is unstable in high dimensional space and some problems that might arise during training is the saturation of the discriminator. In that case the discriminator wins the game (diminished gradients problem).
Generative models in general provides a way to model structure in complex distributions. They have been useful in generating data points (images, music, etc…).
Generative Adversarial Networks are generative models that creates a minimax game with 2 models discriminator and a generator. The discriminator is a simple classifier that is trying to identify real data coming from training distribution and fake data generated by the generator model.
The generator model takes a simple random noise \(z\) sampled from a gaussian/uniform distribution (gaussian is better check ganhacks). The general objective of the minimax game of the GAN is \(min_{G}max_{D} (D,G) = \mathbb{E}_{x \mathtt{\sim} p_{data}(x)}[log D(x)] + \mathbb{E}_{z \mathtt{\sim} p_z(z)} [log(1- D(G(z)))]\)
Before starting explaining the approach proposed, on important thing to understand is the random projections.
Random projections are simply a set of random filters generated before the training and applied to input images during training creating multiple projections of the data to lower dimensional space.
These Random filters are fixed during the training that each discriminator is looking at a different low dimensional view of input datasets. The Random filters are drawn i.i.d from a gaussian distribution and scaled to have unit L2 norm.
In that case, the generator will get meaningful gradient signals from different discriminator each looking at a low dimensional set of features. The more discriminators you have from different projections the better the diversity and the quality of the generator used.
In this game setup, the generator is trying to fool an array of discriminators. Each discriminator on a projection of the input training image is trying to maximize his classification accuracy of real vs fake.
The generator is getting gradient signals from the array of discriminators and tries create samples that will fool all of the discriminators. \(min_{G}max_{D_{k}} \sum_{i=1}^{K} (D_k,G) = \sum_{i=1}^{K} \mathbb{E}_{x \mathtt{\sim} p_{data}(x)}[log D_k(\mathbb{W_k^T}x)] + \mathbb{E}_{z \mathtt{\sim} p_z(z)} [log(1- D_k(\mathbb{W_k^T}G(z)))]\)
In their experiment a simple DCGAN architecture was used on CelebFaces dataset. The details of the architectures along with the experiments are explained in a github notebook.
The idea of using an ensemble of discriminators to stabilize the GAN training is interesting and showed promising results at that time. Stabilizing the GAN training in this setup comes on the expense of the following:
I would recommend reading the paper itself and checking the related work, this is just a summary to give you a rough idea of what is going on.
The scope of work was new to me, using artificial intelligence to predict the production of wells was an exciting type of work to me .And based on the type of work i expected i accepted their offer.
When i joined Raisa, my first task was to work on production forecasting models.
At first i was excited updating and tuning these models and reading papers in the field of oil and gas.
The deployment cycle of models was not finished yet and during the end of the first month i got another task Data Collection
This task was to update their legacy tool used to collect the data in a convenient running time (the legacy version used to run in more than 1 week ) and to add more features to it.
I was able to finish this tool during my second month ,but the requested features were taking more time for dependency on information from others.
During my second month they planned a hackathon across the company and i participated with an idea along with my colleagues Magdoub and Amal.
The idea Who’s working on what was to integrate a wiki with Microsoft Team Foundation, and a machine learning model would associate the related wiki articles together along with the VSTS tasks.
So you can use the web interface to open other’s tasks and to learn more about the technical aspects of these tasks based on the recommended articles from the wiki.
Raisa team was nice and i liked working with them but all the future tasks for me as a data scientist were only related to data collection and controllers development and I was eager to expand my skills in the field of deep learning and looking to join more challenging Research and Development work.
]]>Avito launched a competition on Kaggle challenging users to predict Avito to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.
With ad demand being a promising topic, I was attracted to try to work on this competition having the opportunity to combine several feature types.
As a starting point, I started reading the published kernels and some papers, Dimitri Ad Clicking prediction paper was a detailed attractive paper to predict the probability that a user will click on an ad or be attracted to it based on the thumbnail of the ad.
In this paper some image features were introduced, these features was then implemented using OpenCV and added to a baseline LightGBM Model.
calculate image simplicity => Used to calculate simplicity of input image.
image basic segment stats => Used to extract basic image segmentation statistics (tuple of 10 features).
image face feats => Used to extract number of faces from input image using pretrained HaarCascade from opencv.
image sift feats => number of sift keypoints extracted from input image
image rgb simplicity => get image simplicity feature from RGB image
image hsv simplicity => get image simplicity features from hsv image
image hue histogram => image features from histogram of HSV images
image grayscale simplicity => used for simplicity features on grayscale images
image sharpness => used to calculate image sharpness score
image contrast => used to calculate image contrast score
image saturation => used to calculate image saturation
image brightness => used to calculate image brightness score
image colorfulness => used to calculate colorfulness score based on the paper
Count vectorizer for the title and for the description along with word counts in both of them (user is attracted to short and straight to the point description).
A feature called readability index Flesch English reading ease was extracted. This readability index based on Pyphen dictionary package was calculated by counting the average sentence length which is the lexicons count over the number of sentences found in the input description based on the punctuation.
def flesch_reading_ease( text):
ASL = avg_sentence_length(text) # lexicon_count/nsentences
ASW = avg_syllables_per_word(text) # number of syllables verified by Pyphen
FRE = 206.835 - float(1.015 * ASL) - float(84.6 * ASW)
return legacy_round(FRE, 2)
This feature along with Count vectorizer features for input title and description were improving the model, the difficult part in the input text is the language barrier (Russian Language). I was unable to verify the correctness of the readability index or to check the text patterns to get more features.
Used BorutaPy package to select from all the previously extracted features (image and text features).
Check the code for the features extracted from input images Image Extraction for ad clicking