# Survey on Uncertainty Estimation in Deep Learning

Paper Review ·A distinction between aleatoric and epistemic uncertainties is proposed in the domain of medical decision-making (Senge et al., 2014). Their paper explained that aleatoric and epistemic uncertainties are not distinguished in Bayesian inference. Moreover, the expectation over the model with respect to the posterior is used to get our prediction leading to an averaged epistemic uncertainty. For that limitation, they proposed a framework that can distinguish between aleatoric and epistemic uncertainty and predicts the plausibility of a prediction for each class. From the plausibility measure per class, aleatoric and epistemic uncertainty can be predicted.

The importance of safety in deep learning is clearly shown in (Varshney & Alemzadeh, 2016) backed by an example of a self-driving car accident leading to the driver’s death. They explain the car’s failure in scarce circumstances and emphasize the importance of predicting the epistemic uncertainty in AI-assisted systems. (Kendall & Gal, 2017) show the distinction between aleatoric and epistemic uncertainty, where they discuss the limited awareness of the neural networks’ competence. For example, experiments on image classification have shown that a trained model fails with high confidence on specifically designed adversarial attacks.

### Bayesian Inference

Consider a hypothesis h that delivers probabilistic predictions of outcomes y given input x. \(p_h(y|x) = p(y|x,h)\)

In Bayesian, h is supposed to be equipped with a prior distribution p(.) , and learning consists of replacing that prior with the posterior distribution. Where p(D|h) is the likelihood of h and p(.|D) captures the knowledge gained by the learner, hence its epistemic uncertainty.

\[p(h|D) = \frac{p(h) p(D|h)}{p(d)} \propto p(h) p(D|h)\]The peakedness of this distribution can be used to measure the uncertainty; the more peaked, the more concentrated the probability mass function in a small region in H. Similarly, the sharpness of the space is used in deep learning to predict its uncertainty around training data (Foret et al., 2020) discussed in related work.

The representation of uncertainty about a prediction is given by viewing the posterior p(h|D) under the mapping from hypothesis to probabilities of outcomes, yielding the following predictive posterior distribution

\[p(y|x) = \int_{\mathcal{H}} p(y|x,h) d p(h|D)\]In this type of Bayesian inference, a final prediction is produced by model averaging. The expectation of over the hypothesis is taken concerning the posterior distribution on H, and each hypothesis h is weighted by its posterior probability, which makes it challenging to compute.

In p\(y\|x\), aleatoric and epistemic uncertainties are not distinguished anymore because the lack of knowledge of each hypothesis (aka epistemic uncertainty) is averaged out.

### Fisher Information

Consider a data-generating process specified by a parameterized family of probability measure \(P_{\theta}\) where \(\theta\) is a parameter vector. \(f_{\theta}\) denotes the density function of \(P_{\theta}\), \(f_{\theta}(x)\) is the probability of observing \(x\) when sampling from \(P_{\theta}\).

In such a framework, the main problem is to estimate the parameter \(\theta\) using a set of observations \(D = \{X_1, \cdots, X_{N}\}\). Maximum likelihood estimation is a general principle used to tackle that problem. It prescribes estimating \(\theta\) by the maximizer of the likelihood function or, equivalently, the log-likelihood function. Assuming observations in \(D\) are independent, the log-likelihood function is given by:

\[l_{N}(\theta) = \sum_{n=1}^N \log f_{\theta} (X_n)\]An important result of mathematical states that $\theta$ would converge to a normal distribution \(\hat{\theta}\) as \(N \rightarrow \infty\). More specifically, \(\sqrt{N} (\hat{\theta} - \theta)\) converges to a normal distribution with mean 0 and covariance matrix \(\mathcal{I^{-1}}(\theta)\), where

\[\mathcal{I}_{N}(\theta) = - \left[ \mathbf{E}_{\theta} (\frac{\partial l_{N}}{\partial \theta_i \partial \theta_j}) \right]\]where \(\mathbf{E}_{\theta}\) denotes expectation under the distribution of the random variable when \(\theta\) is fixed.

### Gaussian Processes

Gaussian processes (Seeger, 2004) can be seen as a generalization of the Bayesian approach from an inference about multivariate random variables to an inference about functions. Thus, they can be seen as a distribution over random functions.

More specifically, a stochastic process in the form of a collection of random variables is said to be drawn from a Gaussian process with mean function \(m\) and covariance function \(k\), denoted \(f \sim \mathcal{GP}(m, k)\), if for any finite set of elements \(\vec{x}_1, \ldots , \vec{x}_m \in \mathcal{X}\), the associated finite set of random variables \(f(\vec{x}_1), \ldots , f(\vec{x}_m)\) has the following multivariate normal distribution: \(\left[ \begin{matrix} f(\vec{x}_1) \\ f(\vec{x}_2) \\ \vdots \\ f(\vec{x}_m) \end{matrix} \right] \sim \mathcal{N} \left( \left[ \begin{matrix} m(\vec{x}_1) \\ m(\vec{x}_2) \\ \vdots \\ m(\vec{x}_m) \end{matrix} \right] , \left[ \begin{matrix} k(\vec{x}_1, \vec{x}_1) & \cdots & k(\vec{x}_1, \vec{x}_m) \\ \vdots & \ddots & \vdots \\ k(\vec{x}_m, \vec{x}_1) & \cdots & k(\vec{x}_m, \vec{x}_m)\end{matrix} \right] \right)\)

Intuitively, a function \(f\) drawn from a Gaussian process prior can be thought of as a (very) high-dimensional vector drawn from a (very) high-dimensional multivariate Gaussian. Here, each dimension of the Gaussian corresponds to an element \(\vec{x}\) from the index set \(\mathcal{X}\), and the corresponding component of the random vector represents the value of \(f(\vec{x})\).

Gaussian processes allow for making proper Bayesian inference in a non-parametric way: Starting with a prior distribution on functions \(h \in \mathcal{H}\), specified by a mean function \(m\) and kernel \(k\), this distribution can be replaced by a posterior in light of observed data \(\mathcal{D} = \{ (\vec{x}_i , y_i ) \}_{i=1}^N\), where an observation \(y_i = f(\vec{x}_i) + \epsilon_i\) could be corrupted by an additional noise component \(epsilon_i\). Likewise, a posterior predictive distribution can be obtained on outcomes \(y \in \mathcal{Y}\) for a new query \(\vec{x}_{q} \in \mathcal{X}\).

Problems with discrete outcomes \(y\), such as binary classification with \(\mathcal{Y} = \{ -1, +1 \}\), are made amenable to Gaussian processes by suitable link functions, linking these outcomes with the real values \(h = h(\vec{x})\) as underlying (latent) variables. For example, using the logistic link function \(P( y | h) = s(h) = \frac{1}{1 + \exp( - y \, h)} \, ,\) the following posterior predictive distribution is obtained: \(P( y = +1 | X , \vec{y} , \vec{x}_{q}) = \int \sigma(h') \, P( h' | X , \vec{y} , \vec{x}_{q}) \, d \, h' \, ,\) where \(P( h' | X , \vec{y} , \vec{x}_{q}) = \int P( h' | X , \vec{x}_{q}, \vec{h})\, P( \vec{h} | X, \vec{y}) \, d \, \vec{h} \, .\) However, since the likelihood will no longer be Gaussian, approximate inference techniques (e.g., Laplace, expectation propagation, MCMC) will be needed.

In the case of regression, the variance \(\sigma^2\) of the posterior predictive distribution for a query \(\vec{x}_{q}\) is a meaningful indicator of (total) uncertainty. The variance of the error term (additive noise) corresponds to the aleatoric uncertainty, so the difference could be considered epistemic uncertainty. The latter is primarily determined by the (parameters of the) kernel function, e.g., the characteristic length-scale of a squared exponential covariance function.

The penultimate layer output of a deep neural network is treated as gaussian process (Kimin Lee, 2018). The authors proposed a loss function that clusters representations of different classes into k-means friendly representations with data points evenly clustered around the centers of the clusters. Then, uncertainty can be computed as the Mahalanobis distance between the input data point and cluster centers of different classes.

### Bayesian Deep Learning

A standard neural network can be seen as a probabilistic classifier \(h\): in the case of classification, given a query \(\vec{x} \in \mathcal{X}\), the final layer of the network typically outputs a probability distribution (using transformations, such as softmax) on the set of classes \(\mathcal{Y}\), and in the case of regression, a distribution (as shown in (Kimin Lee, 2018) treated as Gaussian). Training a neural network can essentially be seen as making maximum likelihood inference. As such, it yields probabilistic predictions but no information about the confidence in these probabilities. In other words, it captures aleatoric but no epistemic uncertainty.

In the context of neural networks, epistemic uncertainty is commonly understood as uncertainty about the model parameters, that is, the weights \(\vec{w}\) of the neural network. Bayesian neural networks (BNNs) have been proposed as a Bayesian extension of deep neural networks (missing reference) to capture this type of epistemic uncertainty. In BNNs, each weight is represented by a probability distribution (again, typically a Gaussian) instead of a real number, and learning comes down to Bayesian inference, i.e., computing the posterior \(P( \vec{w} | \mathcal{D})\). The predictive distribution of an outcome given a query instance \(\vec{x}_q\) is then given by \(P(y | \vec{x}_q , \mathcal{D} ) = \int P( y | \vec{x}_q , \vec{w}) \, P(\vec{w} | \mathcal{D}) \, d \vec{w} \enspace .\) Since the posteriors on the weights cannot be obtained analytically, approximate variational techniques are used (missing reference), seeking a variational distribution \(q_{\vec{\theta}}\) on the weights that minimizes the Kullback-Leibler divergence \(\operatorname{KL}(q_{\vec{\theta}} \| P(\vec{w} | \mathcal{D}))\) between \(q_{\vec{\theta}}\) and the true posterior.

#### Monte-Carlo Dropout

Dropout variational inference establishes a connection between Bayesian inference and using dropout during inference, as proposed (Gal & Ghahramani, 2016). Dropout is used as a regularization technique during deep neural network training by randomly disabling some connections. (Gal & Ghahramani, 2016) suggest enabling dropout during inference, and by doing multiple forward passes over randomly disabled connections before each layer, we can predict an empirical Gaussian distribution. The empirical distribution or parameter estimates can then be used to obtain, for example, a mean value and a confidence measure in terms of the distributional variance. We expect that the empirical variance is low where training data were abundant since all network subsets had the opportunity to learn in these areas. However, the network behavior is not controllable in areas with no training data to train on, so we expect a high variance among the different network subsets.

One of the main drawbacks of this proposed technique is the need to do numerous forward passes to predict a meaningful Gaussian distribution over the input data point.

#### Distributional Parameter Estimation

The total uncertainty contains both homoscedastic aleatoric and epistemic uncertainty for the training data points. The aleatoric uncertainty can be heteroscedastic, and we have a different uncertainty estimate for every input point. For that reason, (Kendall & Gal, 2017) proposed loss attenuation that predicts a mean and variance for every input data point.

Assume that our model output has a single \(y\) prediction which is not deterministic but normally distributed with parameters \(\mu(x), \sigma^2(x)\), depending on \(x\) input. We must consider the distributional variance in the training process instead of using the mean squared error. We take the negative log-likelihood function of the normal distribution as a loss function, ignoring constants, \(\mathcal{L}(x,y) = -\log \Phi(y|x) =0.5 \log(\hat{\sigma^2(x)}) + \frac{(y - \hat{\mu})^2}{2 \hat{\sigma^2(x)})}\)

Intuitively, the numerator of the second term in the negative log-likelihood function encourages the mean prediction \(\mu(x)\) to be close to the observed data. At the same time, the denominator makes sure the variance \(\sigma^2(x)\) is significant when the deviation from the mean \((y - \hat{\mu})^2\) is large. The first term is a counterweight for the variance not to grow indefinitely. This strategy can capture the aleatory uncertainty of our data-generating process in areas of sufficient training data. However, training a deep neural network with loss attenuation is often tricky and quickly gets into the Nans’ trap.

#### Ensemble Averaging

Bayesian model averaging establishes a natural connection between Bayesian inference and ensemble learning. Indeed, the variance of the predictions produced by an ensemble is a good indicator of epistemic uncertainty. The variance is inversely related to the “peakedness” of a posterior distribution \(P(h \| \mathcal{D})\). It can be seen as a measure of the discrepancy between the (most probable) candidate hypotheses in the hypothesis space. Based on this idea, (Lakshminarayanan et al., 2017) proposes a simple ensemble approach as an alternative to Bayesian DNNs, which is easy to implement, readily parallelizable, and requires little hyperparameter tuning.

Instead of training a single network, we will train an ensemble of \(M\) networks with different random initializations. While we expect all networks to behave similarly in areas with sufficient training data, the results will be completely different where no data is available. For a final prediction, we now take all networks and combine their results into a Gaussian mixture distribution from which we can, again, extract single mean and variance estimations.

\[\begin{aligned} \hat{\mu}_c(x) &= \frac{1}{M} \sum_{i=1}^M \hat{\mu}_i(x) \\ \hat{\sigma}_c^2(x) &= \frac{1}{M} \sum_{i=1}^M \hat{\sigma}_i^2(x) + \left[ \frac{1}{M} \sum_{i=1}^M (\hat{\mu}_i^2(x) - \hat{\mu}_c^2(x)) \right] \end{aligned}\]This variance prediction even clearly distinguishes between the two types of uncertainty. The first term, denoting an average of all variance estimates, can be interpreted as aleatory uncertainty. The remaining term can be considered epistemic uncertainty, which is low if all mean estimates agree on a similar value and grows if the mean estimates differ widely.

### Conclusion

In summary, uncertainty estimation in deep learning is crucial for ensuring the safety and reliability of AI systems, particularly in scenarios where human lives are at stake, such as autonomous driving. By distinguishing between aleatoric and epistemic uncertainties and employing Bayesian inference techniques, researchers can develop models that not only make accurate predictions but also quantify the level of uncertainty associated with each prediction. Additionally, methods such as maximum likelihood estimation and Fisher information provide valuable tools for estimating model parameters and understanding the underlying data-generating process. Overall, integrating uncertainty estimation techniques into deep learning models is essential for building robust and trustworthy AI systems.

### References

- Senge, R., Bösner, S., Dembczynski, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., & Hüllermeier, E. (2014). Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty.
*Inf. Sci.*,*255*, 16–29. https://doi.org/10.1016/J.INS.2013.07.030 - Varshney, K. R., & Alemzadeh, H. (2016). On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and Data Products.
*CoRR*,*abs/1610.01256*. http://arxiv.org/abs/1610.01256 - Kendall, A., & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
*CoRR*,*abs/1703.04977*. http://arxiv.org/abs/1703.04977 - Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-Aware Minimization for Efficiently Improving Generalization.
*CoRR*,*abs/2010.01412*. https://arxiv.org/abs/2010.01412 - Seeger, M. (2004). Gaussian processes for machine learning.
*International Journal of Neural Systems*,*14*(02), 69–106. - Kimin Lee, P. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks.
*Neural Information Processing Systems*, 184–191. https://proceedings.neurips.cc/paper/1987/file/c81e728d9d4c2f636f067f89cc14862c-Paper.pdf - Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
*International Conference on Machine Learning*, 1050–1059. - Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles.
*Advances in Neural Information Processing Systems*,*30*.