Recogni Perception Team
At Recogni, we develop an industry-leading chip for accelerating deep neural network inference for highly automated driving. This chip runs neural network inference extremely efficiently—you may have seen references to 1000 TOPS at 25 Watts in our previous write-ups. That’s not a pipedream—the chip exists and is customer-deployed. Input to the system are pairs of stereo camera images. Cameras see farther and are more cost effective to integrate in production vehicles than other sensor types such as LIDAR or RADAR. That’s why it’s natural to combine them with our cost-efficient inference chip for a high-performance, low-wattage perception solution.
The system would not be complete without a perception stack to actually run on it. This article presents an overview of how we build and deploy our stereo 3D Object Detection (3DOD) model and continue to improve it over time.
We concentrate on 3DOD because even though it is a very difficult real-world problem, it’s one where our great compute capacity and the ability to process high-resolution imagery very efficiently offer huge benefits. Additionally, it imposes challenges beyond the pure neural network execution, such as decoding of network outputs into 3D boxes in space or running Non Maximum Suppression.
We develop perception architectures for 3DOD and other applications both to deploy with customers and to gather knowledge not readily available – especially how to convert models for this new hardware.
We evaluated several approaches for stereo 3D Object Detection, and settled on using YoloStereo3D  as a reference architecture. Our own Stereo 3D Object Detection network can be seen in Fig.1. The following table shows a summary of the network specifications:
- 58.9M float parameters, compressed to only ~8MiB on chip (≈29x compression from float32)
- 144 FPS @ 1920x1080x4 (theoretical max value computed by in-house compiler)
- < 7 Watts @ 24 FPS
The inputs to our model are left and right image pairs from a stereo camera. Neither rectification nor undistortion is applied (more on why we did this later). The network uses a shared ResNet-34 backbone to extract features from the images. These feature maps are passed into the “Multi-Scale Fusion” module which combines left and right feature maps into multiple recogni-specific cost volumes (for more information about cost volumes refer to ). Think about the cost-volumes as overlaying the left and right camera images, then shifting the right image to the left column by column. The network is trained to identify similarities between image regions from the left and the right image. The number of shifts translates to disparity which is used to decode depth information. These cost volumes are calculated on different levels of resolution—which we call scale—and extract features from multiple feature map scales. The output is combined to regress values for 3D bounding box reconstruction and object classification. We estimate surface depth as an auxiliary task to guide the cost-volumes towards truly learning stereo matching.
The network is trained on HDR 16-bit images with a resolution of 3840×2160×1 RAW which is reshaped to 1920×1080×4. After float training, the network is then prepared for chip deployment by quantizing and clustering weights. Last but not least, our compiler takes care of creating the binaries that are ultimately flashed onto the chip.
Designing networks for specialized, low-power hardware is different than for GPUs since not all operators may be supported on the target platform. Our system-on-chip (SoC) is versatile but some operations are executed less efficiently than others. Furthermore, most researchers do not have to deal with a memory constrained environment, especially not for inference. Memory constraints in combination with quantization effects and limited mathematical operation types makes designing neural networks even more challenging. How we approach designing networks, specifically for our chip, will be described in another blog post.
Distorted and Non-Rectified Image Processing
Convolutional networks that learn to solve stereo matching problems usually rely on undistorted and rectified image pairs. Rectification is a transformation applied to both images that constrains the matching problem to one dimension. After rectification corresponding pixel matches can be found in the same row on the left and right camera image.
Our system is designed to operate on distorted and non-rectified images (DNR). This has the advantage of drastically reducing latency by removing the need for a separate hardware warping unit. By working on DNR data, we shift computational load to our main work-horse, the convolutional accelerator, which has plenty of resources.
Hybrid Heads for Quantization
Quantization (using smaller datatypes than the standard float32 values) is a major topic for edge devices. Usually, weights are reduced to 8 bit integers, saving 75% of weight memory and significantly reducing computational needs. We use a custom number system for even more computational efficiency, but the same principle applies.
Weight quantization, however, imposes issues for accurate regression prediction.
Imagine you want to predict the distance of a car at up to 250 meters. 8 bit integers for encoding the distance prediction only gives you 256 distinct values and therefore a best possible resolution of roughly one meter – that’s clearly not acceptable. To solve this issue, we developed an approach called hybrid regression. In hybrid regression, we first define a number of bins, for example one per every five meters, to get a first coarse localization of a detected object. For the network this is a classification problem. To then achieve the necessary accuracy for automated driving, the network also predicts an offset (a regression output) from the bin’s center. Through this approach, we achieve a possible accuracy of 2cm. Predicting multiple classification and regression outputs to solve a regression problem comes at an increased computational cost. But this extra cost is easily absorbed by our system’s computational resources.
The Seefar Dataset
The gigantic computational resources of our hardware SoC let us perform real-time inference on stereo pairs of 8MP image resolution with high frequency. As such, none of the industry and open source datasets are meeting our requirements. This is why we record our own dataset. On our data collection rig we have mounted a stereo camera pair consisting of two OnSemi 8.3MP AR0820AT cameras and a Hesai Pandar 128 LIDAR—while we don’t use the LIDAR sensor during inference, the LIDAR makes labeling far easier.
For data annotation, we are working with a partner that provides us with 3D bounding boxes for a large number of classes. A later post will go into detail about the process of creating that dataset and the learnings throughout the process.
Lastly, one of the interesting things of working with this dataset is that it reveals the performance limitations of the popular open source libraries that we use for data processing and neural network training. Working with HDR images at a resolution of 1920×1080×4 is a different beast compared to the “standard” 256×256×3 image used in the infamous ImageNet dataset. At about 85x the size per tensor, data reading speed becomes really important. This means that for example the standard PyTorch data loading paradigm of loading single samples in a worker and batching them in the main process no longer works when stacking the samples takes significant time. We’ll also expand on this in a future post.
How to Measure 3DOD Performance
Detailed system performance measurement is extremely important. It guides both development decisions towards architecture improvements and establishes network training steps and when to stop. Measuring this is even more important for us because it allows us to assess conversion performance— i.e. what’s the accuracy penalty we pay for converting a model to run on our chip?
There are several metrics that are often used for 3DOD, for example, the one used in the KITTI challenge or the NuScenes detection metrics. These match ground-truth objects with predicted objects (either using 3D overlap or center distance on the ground) and then compute metrics based on these matches.
While we have used both the KITTI and the NuScenes metrics in the past, we have discovered that their usability for our use case is limited – because they have been built for object detection challenges but we are looking for object detection insights that also reflect real-world performance.
Let’s amplify this with an example: Say we have a single car and a single prediction, and we are using the KITTI metrics which require a 70% 3D-IoU score to associate ground truth and predicted box. Assuming we predict the object’s dimensions of 4.5 x 2 meters perfectly, a lateral offset error of only 80 centimeters prevents said boxes from being matched according to the KITTI matching criteria. Thus, the metric returns both a false negative (an actual car that has not been detected) and a false positive object (an additional detection not based on a true object) when the information we are actually interested in is that we have a lateral offset. Another issue is that the hard matching threshold of 70% 3D-IoU results in the exact same outcome whether we’re off by 80 centimeters or off by 80 meters. As such this metric is clearly not suited to guide model improvements.
We base our custom metrics on two principles:
- Matching boxes from the model’s perspective – this means based on 2D image data. 2D IoU turns out to work well for this – it is fairly invariant to distance (i.e. we match both close predictions and distant predictions similarly).
- Building a set of human-interpretable metrics – this means a human should be able to look at a metric or a metric change and directly see what the change means. Decreasing rotation error from 4° to 2° is clearly understandable, while a decrease in the area under the rotation error curve over precision from 0.2 to 0.15 is not.
The latter part also means explaining the metrics to people that haven’t worked with them daily is easy – this will be yet another topic for a future blog post.
We also compute relevant scores, which we calibrate to be between 0 and 1, with 0.5 being the “minimum acceptable error”. That means we can, at a glance, see how well we are doing.
Lastly, through a combination of all of these scores, we compute a weighted score that serves as a proxy value for general quality. While it is impossible to accurately represent a dozen metrics in a single scalar, it is still important to quickly compare runs – and our experience has shown that when lacking a single predefined metric everyone converges to one metric anyway and uses that to compare experiments.
Of course, these metrics are not just computable for all our data – we took care to keep them meaningful for any arbitrary group of samples. Interested in which single frame we perform worst on? Interested in the recorded sequence we perform best on? Or maybe the recorded day with the worst depth error? All of these are available through metric computation.
In this blog we outlined Recogni’s 3D object detection pipeline. The motivation for developing such a complex and big network is to showcase what can be done with our immense amount of compute capacity. With our hardware solution it is possible to run state-of-the-art networks with low power consumption, low latency and real-time performance. Alongside 3DOD we also focus on other vision applications that are needed for the AD/ADAS industry. The future plan is to integrate all these applications into a vision reference stack.
 Liu Y, Wang L, Liu M. Yolostereo3d: A step back to 2d for efficient stereo 3d detection. In 2021 IEEE International Conference on Robotics and Automation (ICRA) 2021 May 30 (pp. 13018-13024)
 Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry: End-to-End Learning of Geometry and Context for Deep Stereo Regression. In 2017 IEEE International Conference on Computer Vision (ICCV) 2017: (pp. 66-75)