Deep learning models promise to deliver higher precision at much lower computational engine loads. Deep learning models have outperformed the conventional computer vision models, which rely heavily on human-induced manual feature extraction. The conclusion is that deep learning-based predictors are delivering human-like perception levels. The advances in GPU and TPU technologies have made it possible to train deep learning models and act autonomously in advanced driver assistance systems at the edge or in embedded processors with high accuracy levels. However, memory and edge computational engine loads need to be increased for successful deployment of a deep learning-based system.
The deep learning models based on the concepts of nature’s brain processing systems have been deployed for supervised, semi-supervised, unsupervised and reinforcement learning. The learning methods have surpassed human learning capabilities. The multistage deep learning models can find a needle in a haystack with high precision, reduce false positives and exhibit robust representations, which makes the autonomous ecosystem trust the predictions.
Fundamentals of deep learning
Deep learning and machine learning are computational areas that require constant attention. The higher complexity means that success stories are still rare. The cost of training large networks has historically been routine since these classifiers usually need a lot of memory and computational resources to be implemented and executed. However, falling hardware prices and the availability of large datasets led to training models on a large scale with promising results. Training and tuning convolutional neural networks, for example, can consume many months of experimentation. This task is not practical for every practitioner. Hyperparameter tuning and the constant need to find better architectures and speed in training have always been the path to improving performance. Aiming for these goals, obsessing over parameter search and handcrafting adjustments in network architectures, followed by long hours of training, formed the main structure to define model complexity. Large networks are the most common solution to this problem.
Neural networks
Neural networks in ADAS help identify and track objects in real time. The architecture of CNNs, particularly in object detection, is crucial for safety features in autonomous vehicles. The neural network architecture actually used in a variety of perception systems designed for visual perception problems is drawn and classified. Neural network ‘localization’ and ‘detection’ architectures are detailed in the context of object and scene perception for the purposes of automotive active safety and semi-autonomous driving. Deep learning and deep neural networks encompass problems that interest very large industrial sectors.
Convolutional neural networks
Convolutional neural networks (CNN) are a class of deep neural network that have proved to be especially useful for perception tasks such as object classification, detection and segmentation in image and video data. CNN exploits the spatial structure of the data and makes use of characteristics such as local correlation, shared weights and hierarchy to effectively reduce the number of weights and learn important features implicitly. A CNN is made up of a hierarchy of layers, with every layer either learning the weights that will pass on features to the next layer through a set of learnable convolutional filters, or applying an activation function to these features. The first layer of a given CNN takes an input image, and the final layer, known as the classification layer, provides the output of the network by deciding – based on the learned input features – into which category the input falls.
Perception in ADAS and AV systems
Perception relies on multisensor integration (cameras, lidar, radar) to model the vehicle’s surroundings, capturing the shape, location and movement of surrounding objects. This data acquisition informs key features like collision avoidance in ADAS. The information gathered can be quite complex; it includes details related to the shape, size, position, distribution, movement, type and category of the objects in the environment. The potential-level details about the objects include the path of the vehicle, free space available to travel, and the surrounding condition of the vehicle. Driving environments can range from clear weather with good lighting and road markings to poor lighting, rain, fog, snow and different levels of reflection.
Current approaches use range sensors that can gather information about the external environment by collecting range data. These can identify objects like pedestrians, cars, traffic signs, trees, buildings and roads by capturing, processing and analyzing the acquired data. The role of the camera sensor is to capture color or gray images from the environment.
The role of the lidar sensor is to capture and analyze depth data by scanning the environment using lidar technology, and the role of the radar sensor is to capture and analyze radio waves that provide information about the objects. The GPS receiver captures the real-time location information of the vehicle. The main challenge in perception is how to localize the object in a 3D world coordinate frame, estimate its 3D pose and determine its shape or identity based on the input data captured.
Data acquisition
The environment perception task aims to understand the surrounding road scene from raw sensor data. Here, understanding includes detecting objects in the scene, such as vehicles, pedestrians or cyclists, and understanding other co-participating dynamic objects and tracking their motion over time. For each point in time, our perception model takes a set of sensors and returns the object list that at least contains its position, speed and acceleration in 2D and any other available information that further describes the object. Data acquisition uses sensors that provide information on the location and movement of objects around the ego vehicle and the driveability of the road toward that ego vehicle.
Object detection and tracking
Detecting and tracking objects in driving environments is essential for higher-level perception in advanced driver assistance systems and autonomous vehicles, including but not limited to automatic emergency braking, collision warning and autonomous driving maneuvers at intersections and roundabouts. Pedestrians, bicyclists and motorcyclists, in particular, require advanced perception and prediction methods for the efficient operation of these systems, as they behave with significant variability in time and space compared with other road users.
Diverse and standardized datasets have contributed to the rapid development and deployment of deep learning methods in object detection and tracking. Detection approaches based on deep learning can be generally divided into two categories: region proposal-based methods, which generate region proposals across an image first and classify the proposals; and proposal-free methods, which directly regress or predict bounding boxes consisting of class and location information. The first category contains seminal fast and faster regional convolutional neural network methods, their respective residual and feature pyramid network-based improvements, and their predecessors, such as region-based methods. These methods were designed with the goal of higher accuracy at the cost of higher computational requirements. In contrast, single-shot methods for object detection were designed with the goal of real-time detection and tracking while still achieving reasonable detection accuracy.
Challenges and limitations
The current generation of deep learning systems can solve perception problems with superhuman performance, but there are certain limitations and challenges. First, deep learning relies heavily on extensive and representative data to accurately learn models and is challenged when the deployment condition does not match the training condition. The data required for supervised training can be difficult and expensive to collect for autonomous driving. Deep learning systems are data hungry, yet labeling and maintaining the quality of pixel-level or point-level annotations in driving sensors is an arduous task, often performed by humans. Furthermore, it is unrealistic to train models to recognize every object or event that the vehicle might encounter or might be useful to know about. Second, many safety-critical applications such as autonomous driving, often require transparency, interpretability or causality. There is a trade-off between the excellent performance of deep learning models and our human ability to comprehend their function.
Next, deep learning cannot inherently accommodate real-world physics, meaning that predictions can be inconsistent with the rules governing the physical world. For example, everything outside the window of a stopped vehicle can sometimes disappear, except for the vehicle itself and the sky. Deep learning systems are also vulnerable to manipulation by adversarial attacks, which are malicious tampering of the images designed to make the model behave unpredictably. Prior research has shown that all machine learning systems, including deep learning, suffer from such vulnerabilities. However, in contrast to other machine learning models, deep learning models tend to be particularly vulnerable. Finally, deep learning is computationally and energy intensive, which makes its deployment in resource-limited systems difficult. Small deep learning models have significantly lower representational power and thus do not capture the intricacies in the data as well as large deep learning models do, so they trade off accuracy for computational efficiency. Furthermore, because of their high computational demands, it is costly to keep large deep learning models up to date with the changes in the environment.
Data annotation and labeling
Real-world ADAS and autonomous driving perception tasks generally require vast amounts of labeled training data for effective deep learning. This data is typically comprised of sensor observations, such as image and point cloud frames, and object labels. Some common sources of sensor observations used to generate annotated training data include cameras, radar and lidar systems. Common object labels to annotate this data include cars, pedestrians, bicyclists and lane boundaries, to name a few. The accuracy and consistency of the annotations are paramount. Misleading or inconsistent annotations can have a negative impact on a model’s ability to solve a given data-processing task.
Annotating ADAS object perception data can be labor-intensive and error-prone. In practice, the annotation process for the labeled data often involves humans using software tools to manually place annotations and attach textual labels to designated objects. For driving tasks, the annotation process typically chooses visual appearances across a wide range of different image and point cloud backgrounds. Additionally, individual labeled objects must often be further delineated to resolve occlusions and truncations or improve the labeling consistency of adjacent objects. Furthermore, point cloud label data annotation tasks should ensure that sensor observations are represented in 3D space at the sensor’s measured distance. Different representation tasks may be considered depending on the target perception task.
Future directions
We have provided an overview of the current technology used and challenges faced in applying deep learning for perception in advanced driver assistance systems and autonomous vehicles. We have also identified some future directions of research toward enhancing the performance of such systems, resulting in a more efficient fusion of multiple submodels for perception to increase their level of reliability and efficiency, testing the robustness of perception models in their interaction with other systems in a vehicle. This makes the development and customization of new perception models less challenging for infrastructure and OEM engineers.
The development of deep learning solutions for perception in advanced driver assistance systems and autonomous vehicles is facing a myriad of challenges due to the models’ need to balance several qualities, such as correctness, efficiency, cost and reliability of perception decisions. Although deep learning models have done a good job of mitigating many of the common perception problems, the landscape of challenges related to perception in both environments will present interesting research opportunities for many years to come.