Amitabh Yadav

Binarized Neural Networks (BNNs) are neural networks whose weights and activations are binary ( $+1$ or $-1$ ) at runtime. During training, these binarized weights and activations are used to compute parameter gradients. And in forward-pass, the same are used to generate inference. The power of BNNs come from the fact that during forward pass, the complex arithmetic is replaced by bit-wise operations which substantially improves energy efficiency alongwith added benefit of reduced memory size and accesses. The BNN achieved nearly state-of-the-art results on MNIST, CIFAR-10 and SVHN datasets.

Converting Neural Network weights to binary (Binarization function)

Deterministic Binarization:

$x^b = \operatorname{Sign}(x) = \begin{cases} -1, & x < 0, \cr +1, & x \geq 0. \cr \end{cases}$

where, $x^b$ is the binarized value (weight or activation) and $x$ is the real-valued variable.

Stochastic Binarization Function:

$x^b = \begin{cases} +1, & \text{with probability } p=\sigma(x), \cr -1, & \text{with probability 1 - p} \cr \end{cases}$

where $\sigma$ is hard Sigmoid function:

$\sigma(x) = \text{clip}(\frac{x+1}{2},0,1) = \text{max}(0, \text{min}(1,\frac{x+1}{2}))$

Stochastic binarization is harder to implement in hardware as it requires generation of random bits when quantizing. Therefore, mostly deterministic binarization is used (with some exceptions during train-time $-$ dataset dependent).

Gradient Computation and Accumulation

Even through binary weights and activations are used for training, the gradients are still stored as high-precision floats. Without these gradients in full/half-precision floating point (FP), the stochastic gradient descent (SGD) would not work at all.

Binarization also introduces some noise to weights and activations when computing the parameter gradients which acts as a regularization technique to generalise better $-$ like a variation of dropout, instead of randomly setting half of the activations to zero, activations and weights are binarized instead. Also see: Variational Weight Noise, DropOut, DropConnect

Propagating Gradients through Discretization

The derivative of $\text{Sign}$ function is zero almost everywhere which makes it incompatible for backpropagation. So a variation of "straight-through estimator" that takes into account saturation effect and uses deterministic rather than stochastic sampling of the bit. See Estimating or Propagating Gradients Through Stochastic Neurons

Given the $\text{Sign}$ function activation:

$q=\text{Sign}(r)$

and assuming that the estimator $g_q$ of the gradient $\frac{\partial C}{\partial q}$ has been obtained (with the straight-through estimator when needed). then the straight-through estimator if $\frac{\partial C}{\partial r}$ is simply:

$g_r = g_q1_{|r|-1}$

This preserves gradient information and cancels the gradient when $r$ is too large $-$ which if not cancelled, worsens the performance. The derivative $1_{|r|-1}$ can also be seen as propagating gradient through $\text{\textit{hard tanh}}$ , which can be written as the following piece-wise linear activation function:

$\text{Htanh} = \text{Clip}(x, -1, 1) = \text{max}(-1, \text{min}(1, x)).$

For the hidden units, the $\text{Sign}$ function non-linearity is used to obtain binary activations, and for weights the following is done:

Constrain each real-valued weight between $-1$ and $+1$ , by projecting $w^r$ to $-1$ or $+1$ when the weight update brings $w^r$ outside of $[-1,+1]$ i.e. clipping the weights during training. The real valued weights should not grow too large.
When using a weight $w^r$ , we quantize it using $w^b = \text{Sign}(w^r)$ .

Complete training detail is illustrated Algorithm 1 and 2 in BNN paper.

In addition to these above steps, Shift-based Batch Normalization and Shift-based AdaMax algorithms are used to reduce the number of multiplications, instead of their vanilla variants. See Algorithm 3 and 4, respectively in BNN paper.

In the BNN, since all weights and activations are binary, all layers' inputs are also binary with the exception of the first layer. In this case, the first layer is instead quantized in 8-bit fixed point:

$s = x \cdot w^b$ .
$s = \sum_{n=1}^{8} 2^{n-1}(x^n \cdot w^b)$

where, $x$ is a vector of $1024$ $8$ -bit inputs, $x_1^8$ is the most significant bit of the first input, $w^b$ is a vector of $1024$ $1$ -bit weights, and $s$ is the resulting weighted sum. See Algorithm 5 in BNN paper.

This work's main contribution is that it has successfully binarized weights and activations in both, the inference phase and the training phase of the deep neural network. A good discussion of previously implemented binary neural networks is presented in Section 5 in the BNN paper.

Amitabh Yadav

A streaming configuration schema for ASIC interfacing with Microcontrollers

Hardware-specific tidbit for Binarized Neural Networks (BNN)