Amitabh Yadav

Tue,Dec 22025
2:06 PM· Daily Chores

A streaming configuration schema for ASIC interfacing with Microcontrollers

Today, I have to work on implementing a sleek SPI subordinate module for one of our ASICs. While neural interfaces asics work great when testing in the lab with high-speed FPGAs for programming and configurability for high-bandwidth data recording but real-life ASICs need a more adaptable interface such as SPI which may not have the required data bandwidth to record all at once but SPI is fast enough to provide multiple parameters of interest. I am aiming to test this interface on the ARM microcontroller from ST Microelectronics talking to the RTL of SPI-S implemented on the FPGA. Main challenge points would be figuring out the delays betweenCC and SCLK lines due to delays in the MCU processing (as the final MCU may be different from the one I am using to test) and I want to push the clock rate to as high as 20.48MHz. At that rate I can safely log data on the MCU end.

spi_concept_singleSPIpng_cropped.jpg

Thu,Nov 272025
10:29 AM· Machine Learning

Hardware-specific tidbit for Binarized Neural Networks (BNN)

Binarized Neural Networks (BNNs) are neural networks whose weights and activations are binary (+1+1 or 1-1) at runtime. During training, these binarized weights and activations are used to compute parameter gradients. And in forward-pass, the same are used to generate inference. The power of BNNs come from the fact that during forward pass, the complex arithmetic is replaced by bit-wise operations which substantially improves energy efficiency alongwith added benefit of reduced memory size and accesses. The BNN achieved nearly state-of-the-art results on MNIST, CIFAR-10 and SVHN datasets.

Converting Neural Network weights to binary (Binarization function)

Deterministic Binarization:

xb=Sign(x)={1,x<0,+1,x0.x^b = \operatorname{Sign}(x) = \begin{cases} -1, & x < 0, \cr +1, & x \geq 0. \cr \end{cases}

where, xbx^b is the binarized value (weight or activation) and xx is the real-valued variable.

Stochastic Binarization Function:

xb={+1,with probability p=σ(x),1,with probability 1 - px^b = \begin{cases} +1, & \text{with probability } p=\sigma(x), \cr -1, & \text{with probability 1 - p} \cr \end{cases}

where σ\sigma is hard Sigmoid function:

σ(x)=clip(x+12,0,1)=max(0,min(1,x+12))\sigma(x) = \text{clip}(\frac{x+1}{2},0,1) = \text{max}(0, \text{min}(1,\frac{x+1}{2}))

Stochastic binarization is harder to implement in hardware as it requires generation of random bits when quantizing. Therefore, mostly deterministic binarization is used (with some exceptions during train-time - dataset dependent).

Gradient Computation and Accumulation

Even through binary weights and activations are used for training, the gradients are still stored as high-precision floats. Without these gradients in full/half-precision floating point (FP), the stochastic gradient descent (SGD) would not work at all.

Binarization also introduces some noise to weights and activations when computing the parameter gradients which acts as a regularization technique to generalise better - like a variation of dropout, instead of randomly setting half of the activations to zero, activations and weights are binarized instead. Also see: Variational Weight Noise, DropOut, DropConnect

Propagating Gradients through Discretization

The derivative of Sign\text{Sign} function is zero almost everywhere which makes it incompatible for backpropagation. So a variation of "straight-through estimator" that takes into account saturation effect and uses deterministic rather than stochastic sampling of the bit. See Estimating or Propagating Gradients Through Stochastic Neurons

Given the Sign\text{Sign} function activation:

q=Sign(r)q=\text{Sign}(r)

and assuming that the estimator gqg_q of the gradient Cq\frac{\partial C}{\partial q} has been obtained (with the straight-through estimator when needed). then the straight-through estimator if Cr\frac{\partial C}{\partial r} is simply:

gr=gq1r1g_r = g_q1_{|r|-1}

This preserves gradient information and cancels the gradient when rr is too large - which if not cancelled, worsens the performance. The derivative 1r11_{|r|-1} can also be seen as propagating gradient through hard tanh\text{\textit{hard tanh}}, which can be written as the following piece-wise linear activation function:

Htanh=Clip(x,1,1)=max(1,min(1,x)).\text{Htanh} = \text{Clip}(x, -1, 1) = \text{max}(-1, \text{min}(1, x)).

For the hidden units, the Sign\text{Sign} function non-linearity is used to obtain binary activations, and for weights the following is done:

  1. Constrain each real-valued weight between 1-1 and +1+1, by projecting wrw^r to 1-1 or +1+1 when the weight update brings wrw^r outside of [1,+1][-1,+1] i.e. clipping the weights during training. The real valued weights should not grow too large.
  2. When using a weight wrw^r, we quantize it using wb=Sign(wr)w^b = \text{Sign}(w^r).

Complete training detail is illustrated Algorithm 1 and 2 in BNN paper.

In addition to these above steps, Shift-based Batch Normalization and Shift-based AdaMax algorithms are used to reduce the number of multiplications, instead of their vanilla variants. See Algorithm 3 and 4, respectively in BNN paper.

In the BNN, since all weights and activations are binary, all layers' inputs are also binary with the exception of the first layer. In this case, the first layer is instead quantized in 8-bit fixed point:

s=xwb s = x \cdot w^b.
s=n=182n1(xnwb)s = \sum_{n=1}^{8} 2^{n-1}(x^n \cdot w^b)

where, xx is a vector of 10241024 88-bit inputs, x18x_1^8 is the most significant bit of the first input, wbw^b is a vector of 10241024 11-bit weights, and ss is the resulting weighted sum. See Algorithm 5 in BNN paper.

This work's main contribution is that it has successfully binarized weights and activations in both, the inference phase and the training phase of the deep neural network. A good discussion of previously implemented binary neural networks is presented in Section 5 in the BNN paper.