By: Sukrit Jindal (DSG, IIT Roorkee)
Machine Learning (ML), an ever-evolving field, has found widespread success with the advent of deep learning models, or deep neural networks. These architectures, collectively grouped under the term Deep Learning (DL), referring to the depth of these networks, have performed exceptionally well in fields like Computer Vision (CV), which deals with visual data such as images and videos, and Natural Language Processing (NLP), which deals with textual data. Despite all this, these state-of-the-art methods remain vulnerable to adversarial attacks, which is not ideal considering how this limits their applicability particularly in safety-critical domains like finance and healthcare. Developing attacks and defences for ML applications, in order to strengthen them, forms the core of the field of study known as adversarial ML, which is a subset of the broader domain of ML Security, similar to white-hat hacking.
Now you might be wondering, what even is an adversarial attack? Well, an adversarial attack, particularly in the context of this blog, refers to the case when a classifier incorrectly predicts an input derived by adding small, imperceptible perturbations to an input which it otherwise could classify correctly. In terms of mathematical notation, an adversarial attack on a classifier $f\left(\boldsymbol\cdot\right)$ involves adding a small perturbation $\delta$ to a correctly-classified input $x$ (for which $f\left(x\right) = c_{true}$), such that $x+\delta$ is virtually indistinguishable from $x$, which causes the classifier to incorrectly label $x+\delta$ (for which $f\left(x+\delta\right) \neq c_{true}$). This was first shown in 2014-15 by the authors of the paper “Explaining and Harnessing Adversarial Examples” in which they introduce this issue and present a method to perform such attacks using a technique they termed as the fast gradient sign method, and later utilise these adversarial examples in retraining the network in a technique known as adversarial training.
A demonstration from the “Explaining and Harnessing Adversarial Examples” paper which shows that by adding a small perturbation to an image of a panda, which leaves it virtually unchanged to human eyes, causes the GoogLeNet model to misclassify it a gibbon, with even more confidence (99.3%) than what it gave to the original image when classifying it as a panda (57.7%).
Inspired from this technique of generating adversarial examples and utilising them in training, began, what many would call a cat-and-mouse chase, where various researchers developed methods where adversarial defences were failed using progressively stronger attacks in response which new defences came up. Adversarial robustness, i.e. a field of research dedicated to making models more robust to adversarial attacks, eventually realised the unsustainability of these heuristic approaches and developed techniques for certified robustness. What certified robustness aims to do is to “certify” a region around an input $x$ such that the classifier’s prediction remains verifiably constant within this region. While certified robustness is much more attractive for adversarial robustness, most methods failed to scale well to deep neural networks which perform well with datasets like ImageNet.
Formalising the notion of adversarial attacks, we can now define the following:
<aside> 💡
Adversarial examples are subtle input perturbations that cause the classifier to give incorrect outputs. Consider a classifier $f:\real^D \to C$, that assigns a label to an input $x\in X \subseteq \real^D$, we define an adversarial example $x'$ as one where $D\left(x', x\right) < \epsilon$ for some scalar $\epsilon > 0$, and $f\left(x'\right)\neq c_{true}$ when $c_{true}\in C$ is the true class for $x$. Here, $D\left(\boldsymbol\cdot,\boldsymbol\cdot\right)$ represents a distance metric, often expressed using an $l_p$ norm, such as the $l_0$, $l_1$, $l_2$ and $l_\infty$ norms.
We then define $B_{D, r} \left(x\right)$ as the region of all points, for some distance metric $D\left(\boldsymbol\cdot,\boldsymbol\cdot\right)$ and perturbation distance $r>0$, such that $x' \in B_{D,r}\left(x\right) \iff D\left(x',x\right) \leq r$ . When $D\left(\boldsymbol\cdot,\boldsymbol\cdot\right)$ is an $l_p$ norm, we simply write it as $B_{p,r}\left(x\right)$. For most practical purposes, certified robustness is defined relative to $l_p$ norms, thus, we use $B_{p,r}\left(x\right)$ from here onwards. From here, an $(l_p, r)$-adversary is defined as a function which generates $x' \in B_{p,r}\left(x\right)$ such that $f\left(x'\right)\neq c_{true}$.
</aside>
Based on this, we distinguish between empirical robustness and certified robustness as follows,
<aside> 💡
In empirical robustness (particularly using adversarial training), the goal is to use an $(l_p, r)$-adversary to generate adversarial examples then train the model to be robust against this adversary. In certified robustness, the goal is to ensure that $\Pr(f\left(x'\right) \neq c_{true}) \geq 1-\alpha$ for some small value $\alpha \geq 0$. When $\alpha=0$, the certified defence is said to be exact, else it is inexact and provides only probabilistic guarantees. The value of $r$ certified for each point $x$ is termed as the certified bound at that point. While most empirical robustness techniques use heuristics and end up being dependent on the type of the $(l_p, r)$-adversary used, certified robustness aims to be agnostic to this.
</aside>
Amongst the vast sea of techniques for certified robustness, include both exact and inexact ones, we turn our focus to an inexact certified robustness technique known as Randomized Smoothing which has achieved considerable success in the domain of certified robustness and inspired a ton of research into improving it further.
In 2018-19, there came two papers, both of which utilised a technique termed as randomized smoothing to certify robustness bounds. However, the bounds shown in these papers were very loose, i.e. they were not optimal and underestimated the certified radius. In this regard, Cohen et al. published a paper titled “Certified Adversarial Robustness via Randomized Smoothing” in 2019 which re-evaluated the robustness bounds for randomized smoothing and established optimal bounds for the same. This led to an increasing interest in the use of randomized smoothing and its variants, since it is a model-agnostic approach which achieves significant robust radii as opposed to other certified techniques, at a fraction of their computing cost.
Now you might be thinking that we’ve been talking about Randomized Smoothing but what exactly is Randomized Smoothing and how does it even work. Well for that, we need to look into some of the mathematics behind it, which for the purpose of thblog skips much of the detail, and focuses in the broad concept instead. If you do wish to understand the mathematical concepts underlying Randomized Smoothing, you will find the requisite details in Cohen et al.’s paper.
<aside> 💡
Consider a classification problem where the goal is to construct a classifier from the input domain $\left(\real^d\right)$ to a set of classes $\left(C\right)$. Let us consider a classifier $f:\real^d\to C$ which we have trained for this problem. Let us call it the base classifier and note that it lacks adversarial certification.
Essentially, the goal of Randomized Smoothing is to derive a new, smoothed classifier, $g:\real^d\to C$, from the base classifier which when queried at any input $x$ returns the class which is most probable when $x$ is perturbed by isotropic Gaussian noise,
$$ g\left(x\right) = \argmax_{c\in C} \Pr\left(f\left(x+\epsilon\right)=c\right) \text{ where } \epsilon \sim \mathcal N \left(0, \sigma^2 I\right) $$
One can also view the smooth classifier as applying a convolution with Gaussian noise to the base classifier, known as a Weierstrauss transform in mathematical terms, i.e. $g = f*\mathcal N \left(0, \sigma^2 I\right)$.
</aside>
Now you may wonder that given this version of Randomized Smoothing, is it not possible to calculate exactly this most probable class, i.e. why is Randomized Smoothing not an exact certified defence technique? To answer your question, the simple answer is that it is impossible to evaluate this bound exactly due to the infinite number of values over which the classifier would need to be evaluated. Thus, Randomized Smoothing makes use of Monte Carlo sampling to establish bounds which succeed with a sufficiently high level of probability. Here, it is to be noted that the choice of $\sigma$ as a hyperparameter has implications related to the robustness-accuracy tradeoff as increasing it leads to more robust bounds but diminishes its accuracy.
<aside> 💡
In brief, the probabilistic guarantees provided by Randomized Smoothing rely on Monte Carlo sampling and the Neyman-Pearson lemma which establishes upper and lower bounds on parameter estimates from these samples. The core of the lemmas is that we can establish, using a confidence value of $1-\alpha$ for some $\alpha \in (0,1)$ an upper and lower bound on parameter stimats. Further, for a fixed $\alpha$, these confidence bounds become tighter, i.e. the gap between the upper and lower bounds decreases, as the number of samples used to generate these bounds increases.
</aside>