Converting a Network for the Fastest Inference Chip on the Market

Published on

September 19, 2022

Subscribe to newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Recogni Core-Tech Team

09/19/2022

*Conversion: Making a full-precision network compatible with our chip

Introduction

Deep Neural Networks (DNNs), in particular convolutional ones, are the go-to approach for visual perception tasks that enable autonomous driving. To achieve quick iteration cycles during their initial development phase, one typically compromises on computational efficiency of the network that is being developed. Networks are over-parameterized, rely on unnecessarily precise computations, and contain components such as batch normalization that improve trainability but are not required for deployment.

Getting rid of these inefficiencies through a mathematical conversion process, we at Recogni achieve undegraded task performance on our specialized accelerator and tap huge power, cost, and latency savings.

This blog article describes how conversion* is done at Recogni and how we achieve improvements of over 10x with respect to our competition. In the course of this blog article, but also in the following articles of this series, we will dive deeper into how we measure these improvements and how our conversion solution differentiates us from competition. The first section of this article presents some examples that visualize the output of the mentioned conversion process.

Examples of converted networks

We have already converted numerous architectures successfully that are trained on tasks ranging from simple Image Classification to more sophisticated ones such as Semantic Segmentation as well as 2D and 3D Object Detection.

Semantic Segmentation

For example, one of our Semantic Segmentation networks that is trained on the A2D2 dataset achieves 83.78% mean IoU (Intersection over Union). For this metric, the IoU is first calculated per class and then averaged across all classes. For comparison, the original network achieved 86.25% mean IoU. A single scalar can sometimes not capture all the details required to judge a network’s performance. Therefore, we present a video clip that compares predictions of the original float network (top) with the ones of the converted SemSeg network (bottom).

Looking at the differences in more detail, the following image compares the predictions of the original network (left) with the predictions of the converted network (right) for a single sample of the dataset. It can be seen that the converted network reproduces the original predictions almost perfectly.

Subtracting both predictions reveals only tiny differences at the borders between classes. Black pixels indicate that the difference is miniscule. Colorful pixels indicate a discrepancy, the brighter the bigger the difference.

3D Object Detection

We furthermore decided to address the problem of 3D object detection from stereo-image inputs. On the one hand, this task allows us to demonstrate the real-world usability of our system. On the other hand, 3DOD is a complex task that imposes challenges on training, conversion and execution on our inference SOC.

To learn more about our approach on 3DOD, check out our previous blog post: Recogni’s Stereo 3D Object Detection Pipeline. The network has a traditional DNN backbone, multiple detection heads, specialized instructions to estimate depth from the left-right image pairs and — finally — operations to decode the network’s predictions into actual 3D boxes in space. Compare this to, for example, image classification tasks which are commonly solved using a simple feed-forward network.

Our 3D bounding box predictions are evaluated based on their position, dimensions, rotation, and classification against ground-truth values. It is essential to estimate all these parameters precisely — and this estimation often requires specialized strategies when running at low precision math. For example, when converting image disparity to object depth values, we leverage a hybrid approach between regression and classification, as explained in our previous blog post. Solving the challenge of converting a 3DOD network and deploying it on our chip thus allowed us to fine-tune our approach on various aspects of the conversion problem. We show that our conversion strategies can handle complex problems and are well suited for versatile aspects of the full 3DOD prediction pipeline.

The following video showcases our latest converted 3DOD model. It compares the predictions of the converted network (bottom half of the video) with the predictions of the float network (top half of the video). Next to the camera stream, you can see the bird’s eye view of the scene for comparison.

Reducing the overparameterization of the network by compressing its weights is one of the pillars for deployment on our SOC. For example, the weights of the 3DOD network presented above are compressed from 240 MB to ≈ 8.2 MB. But how do we achieve a 29x compression rate while still maintaining an accuracy similar to the floating-point model? The following sections — in particular the one about Cluster Compression — will demonstrate how we at Recogni strike this balance between compression rate and high accuracy.

Conversion

Overview

As already mentioned, most state-of-the-art network architectures contain inefficiencies that can be optimized including over-parameterization, overly precise computation and unnecessary calculations for deployment. Our chip is designed from the ground up to exploit all these inefficiencies. Through hardware-algorithm co-design we are able to find the best tradeoff between accuracy, compute, memory, and power consumption. To achieve this tradeoff, we combine our custom logarithmic number system (LNS) with neural network cluster compression techniques (CC) and standard network optimizations such as Batch Norm Folding (BNF). The following chart illustrates our conversion process. The following sections will explain each conversion step in more detail.

Logarithmic Number System (LNS)

Most operations used in modern DNNs including convolutions can be decomposed into mathematical primitives such as additions and multiplications. Multiplications are known to be expensive in hardware, particularly for high precision number formats. This is because their required chip area grows roughly quadratically with the bit width of the inputs.

Our Logarithmic Number System (LNS) is an optimized FP8 variant with approximate math specifically tailored to NNs. It enables us to replace costly multiplications in the linear domain with additions in the logarithmic domain. The following example illustrates the fundamental idea of LNS.

The same principle can be applied to convolutions. Take this 2D convolution as an example. The formula is simplified for a single pixel of a single output channel.

Note that because the weights w are stored in the logarithmic domain, the respective log2(w) does not need to be computed during runtime. For the summations over the kernel window that is spanned by i and j and across the channel dimension chin, it is advantageous to go back to the linear domain. Because the conversion between linear and logarithmic domains is not trivial, we use a highly optimized proprietary logic on our chip.

Cluster Compression (CC)

Inference of state-of-the-art DNN architectures requires not only huge amounts of compute but also memory. Specialized NN accelerators must not only provide sufficient compute resources but must also be able to feed them at high bandwidths. Therefore, in many systems, memory — not compute capabilities — becomes the main bottleneck regarding speed and energy consumption. In particular, accessing data in off-chip memory such as DRAM is extremely costly from a power consumption point of view.

Our logarithmic number format addresses this problem already by shrinking the size of every single parameter and activation used during inference. Because DNNs are known to be overparameterized in general, there is a lot of possibility to improve beyond the fundamental data type. It is well known that the number of parameters of a network can be reduced heavily with only minor loss in model accuracy — for example by pruning channels or applying compression on the weights.

To exploit this fact, our chip uses a proprietary compression scheme that allows storing compressed weights and hence reducing the memory requirements of DNNs even further. Our compression scheme builds on top of cluster compression as introduced in [1] and is optimized to integrate well into our chip design. Cluster compression uses the k-means algorithm to find a compressed representation of each weight tensor, as illustrated for two-dimensional weights in the following graphic.

After clustering, the weights are converted into separate pointer and lookup-tables (LUT). Every centroid is reused by multiple input and output channels. After the initial clustering, the centroids can be further adjusted by standard backpropagation. With this additional layer of compression, we are able to fit all of a network’s weights into on-chip memory resulting in significantly reduced data movement and vastly reduced energy consumption.

Batch Norm folding (BNF)

DNNs often contain operations such as batch normalization layers that are not required for deployment. Implementing these in hardware is expensive when using floating point precision. Fortunately, we can get rid of batch normalization layers during the conversion process by folding their scale and shift parameters into the preceding layer’s weights. This procedure is called Batch Norm Folding (BNF) and has become a standard part for deploying neural networks.

Conclusion

As you have seen in the video clip from the beginning, our SOC performs demanding tasks such as running 3D Object Detection on 8 megapixel images at high frame rates with low latency. With this high compute density at such a low power consumption, we enable the next step in perception technology for autonomous driving.

We hope this article helped to shed some light on our approach at Recogni in converting such complex networks for deployment on our hardware. If any aspect is unclear, leave us questions in the comments down below.

If you are as excited about this as we are, make sure to follow our blog series and feel free to contact us to learn more. Also consider applying to us, if you want to shape this industry together with us!

References

[1] S. Son, S. Nah, and K. M. Lee, “Clustering convolutional kernels to compress deep neural networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 216–232.

‍

Click here to read article