Advancements in Neural Network Architecture for Tesla's Full Self-Driving (FSD) System


19 min read

Table of contents

No heading

No headings in the article.

Hi all. Today, we delve deep into the architecture that makes Tesla's Full Self Driving (FSD) Beta possible. This process, to my knowledge, is up to date as of April 2023. There are many parallel similarities between the architecture discussed here and the recently hyped GPT-4's MoE (Matters of Expert) architecture. Tesla's AI team had envisioned this for the last decade. Only now, are we beginning to see the true power of the massive fleet Tesla has on the roads.

The Tesla team is developing a synthetic visual cortex for the car, which is designed to process information from the light hitting the artificial retina. The biological visual cortex has an intricate structure, with areas that organize the information flow. The information is organized in a specific layout, and when designing the visual cortex, the neural network architecture is designed to process this information.

Alt text

The neural networks were initially developed four years ago when the car was mostly driving in a single lane on the highway. At that time, processing was only on individual image levels, with each image being analyzed by a neural net. This process involved a 1280 by 960 input, with 12-bit integers streaming in at around 36 hertz. The neural network was instantiated using residual neural networks, specifically RegNets, which offer a nice design space for neural networks.

RegNets give outputs of features at different resolutions and scales, with high resolution information with low channel counts and low resolution spatially but high channel counts. The top layers have neurons that scrutinize the detail of the image, while the bottom layers have neurons that can see most of the image and have a lot of scene context. The feature pyramid networks, BIFPNs, are used to process this information at multiple scales, effectively sharing information.

After a BIFPN and feature fusion across scales, task-specific heads are created, such as object detection and traffic light recognition and detection. This architectural layout allows for amortization of forward pass inference in the car at test time, making it more efficient than having multiple backbones for each task.

The architectural layout has several benefits, including feature sharing, which allows for amortization of forward pass inference at test time. This is especially useful for tasks like traffic light recognition and lane prediction.

In conclusion, the team has developed a neural network architecture for the car's visual cortex, which includes a shared backbone and hydranets, which are task-specific heads. This architecture offers several benefits, such as cost-effectiveness and efficiency in testing.

The proposed approach to FSD involves decoupling all tasks, allowing for individual work on each task without impacting other tasks. This can be achieved by uprevating data sets or changing the architecture of the head, which can be expensive. Additionally, the bottleneck in features is often addressed by caching features to disk, which is only fine-tuned from the cached features up. This approach is typically used in training workflows, where each task is trained jointly, then cached at the multi-scale feature level.

The first problem with this approach was when working on Smart Summon, where the predictions were processed individually for each camera. The problem was that image space predictions could not be directly driven on image space predictions. Instead, they needed to cast them out and form a vector space around the image. This was achieved using C++, which developed the occupancy tracker. However, there were two major problems with this setup: tuning the occupancy tracker and its hyperparameters was extremely complicated, and the image space was not the correct output space.

The problem with this setup was that the predictions were not accurate in vector space, as it required an extremely accurate depth per pixel to accurately predict the depth in every single pixel of the image. Additionally, predicting occluded areas in vector space was not possible, as it was not an image space concept in that case.

The other problem with this approach was object detection. If predictions were made per camera, it was difficult to predict the entire car, and it was difficult to fuse these measurements. The proposed approach aims to take all images and simultaneously feed them into a single neural net and output them in vector space. This is a more challenging task, but it involves creating a neural network component that processes every single image with a backbone, re-representing the features from image space features to vector space features, and then decoding the head.

However, there are two problems with this approach: creating the neural network components that perform this transformation, making it differentiable for end-to-end training, and obtaining vector space data sets for predictions. To overcome these challenges, the proposed approach aims to create a more efficient and accurate FSD system that can handle the complexity of image space predictions and object detection.

The neural network architecture is crucial for predicting object detection in a car. The problem of predicting bird's-eye view is a challenge, as it relies on vector space labels. The support for this prediction comes from the image space, which is influenced by factors such as camera positioning, extrinsics, and intrinsics. To address this issue, a transformer is used to represent the space, which uses multi-headed self-attention and blocks off it. This process involves initializing a raster of the desired output space, tiled with positional encodings with sines and coses, and encoded with an MLP into a set of query vectors.

All images and their features emit their own keys and values, which are then fed into the multi-headed self-attention. This process effectively broadcasts each image piece in its key what it is part of, and every query is a pixel in the output space at this position looking for features of this type. The keys and queries interact multiplicatively, and the values are pooled accordingly.

However, there is another problem with the engineering process. To achieve this, the camera calibration of the cars is crucial. To feed this information into the neural net, the camera calibrations of all images need to be concatenated with an MLP. However, transforming all images into a synthetic virtual camera using a special rectification transform can improve performance. This involves inserting a new layer above the image, called the rectification layer, which is a function of camera calibration and translates all images into a virtual common camera.

The results show that the neural net now significantly improves predictions in vector space, making it night and day. This has been achieved through time and engineering, with incredible work from the AI team. The multi-camera network also improves object detection, as it can predict cars based on their position and size.

In a more nominal situation, the setup becomes unsuitable for large vehicles, and multi-camera networks struggle significantly less with these predictions. Overall, the neural network architecture is essential for accurate and efficient car detection.

Multi-camera networks are capable of making predictions in vector space, but they operate independently at every instant in time. To make predictions that require video context, the researchers have tried to incorporate video modules into their neural network architecture. This involves incorporating multi-scale features, a feature queue module that caches features over time, and a video module that fuses this information temporally. The decoding heads then process both blocks.

The feature queue is a concatenation of features over time, kinematics, and positional encodings. This information is concatenated, encoded, and stored in the feature queue, which is consumed by a video module. The pop and push mechanisms are crucial in this process. The challenge lies in the timing of the push and pop mechanisms. For example, when a car is temporarily occluded, the neural network can look back in time and learn the association between the occluded car and its previous features.

A time-based queue is necessary for detecting occluded cars, as it allows the neural network to look back in time and learn the association between the occluded car and its previous features. This is particularly important for situations where the car is in a turning lane and the lane next to it is going straight. The space-based queue is also necessary, as it pushes every time the car travels a fixed distance.

In this case, a time-based queue and a space-based queue are used to cache features, which are then used in the video module. These details are crucial in making predictions about road surfaces, road geometry, and other factors that may affect the accuracy of the predictions.

The text discusses the development of a spatial recurrent neural network video module for enhancing object detection in a car's video module. The method involves organizing the hidden state into a two-dimensional lattice, updating only the parts near the car and where the car has visibility. The kinematics are used to integrate the position of the car in the hidden features grid, only updating the RNN at the points that are nearby.

The hidden state of the RNN is organized into different channels, which track various aspects of the road, such as the centers, edges, lines, and road surface. The recurrent neural network can now selectively read and write to this memory, allowing the network to selectively read and write to specific locations. This allows the network to have clear visibility when the car goes away, allowing it to write information about what's in that part of space.

The video networks also improve object detection by allowing for multiple trips through the map. In a case where two cars pass in front of each other, the predictions are roughly equivalent, with multiple orange boxes coming from different cameras. When the cars are partially occluded, the single frame networks drop the detection, but the video module remembers it and persists the cars.

The video module also sees significant improvements in its ability to estimate depth and velocity. In a clip from the remove the radar push, the radar depth and velocity are shown in green and orange, respectively. The video module is placed right on top of the radar signal, resulting in higher quality depth and velocity predictions.

In conclusion, the spatial recurrent neural network video module has shown promising results in improving object detection and estimating depth and velocity. The video module's ability to integrate information temporally and efficiently with the radar signal has made it a promising solution for improving car navigation and video analysis.

The architecture of the system consists of raw images fed into a common virtual camera, processed through regnets and residual networks, and fused with BIFBN. This information is then re-represented into the vector space and output space, and feeds into a feature queue in time or space that is processed by a video module, such as the spatial RNN. The hydronet structure is then used for various tasks.

The architecture has evolved from a simple image-based single network about three or four years ago and is now quite impressive. There are still opportunities for improvements, such as fusion of time and space, cost volumes, or optical flow-like networks. The outputs of the system are dense rasters, which can be expensive to post-process and subject to strict latency requirements.

The team is currently working on predicting the sparse structure of the road, either point by point or in other ways that don't require expensive post-processing. This approach achieves a nice vector space.

The planning and control teams are working on optimizing the system to maximize safety, comfort, and efficiency. The key problem in planning is the non-convex action space and high-dimensionality of the action space. Non-convex problems can have multiple possible solutions that can be independently good, but getting a globally consistent solution is tricky. Discrete search methods are great at solving non-convex problems because they are discrete, and continuous function optimization can easily get stuck in local minima and produce poor solutions.

To solve high-dimensional problems, the team breaks the problem down hierarchically. They use a code search method to crunch down the non-convexity and come up with a convex corridor, while continuous optimization techniques create the final smooth trajectory. This process involves thousands of searches in a short time span, and the car then chooses a path based on the automatic conditions of safety, comfort, and easily making the turn.

The car's trajectory matches the plan as it executes the trajectory, and the cyan plot on the right side shows the actual velocity of the car and the white line underneath it is the plan. This well-made plan allows the car to plan for 10 seconds and achieve a well-made plan.

When driving alongside other agents, it is crucial to plan for everyone jointly and optimize for the overall traffic flow. This is achieved by running the Autopilot Planner on every relevant object in the scene. For example, in an auto corridor, Autopilot slows down and realizes that it cannot yield to an oncoming car due to limited space. However, another car can yield to Autopilot, and Autopilot makes progress by reasoning about the car's low velocity and asserting its decision.

Another oncoming car arrives, with a higher velocity, and Autopilot runs the Planner for them. The plan then goes around their side's parked cars and then returns to the right side of the road. Autopilot has multiple possible futures for this car, but the green one yields to us. As Autopilot pulls over, it notices that the car has chosen to yield to us based on their yaw rate and acceleration. This decision changes Autopilot's mind and continues to make progress, ensuring that Autopilot is not too timid and not a practical self-driving car.

The search and planning for other people set up a convex valley, and continuous optimization is performed to produce a final trajectory. The convex corridor is created by initializing the spline in heading and acceleration, parameterized over the arc length of the plan. The optimization continuously makes fine-grained changes to reduce costs such as distance from obstacles, traversal time, and comfort.

In summary, the search for both the driver and everyone else in the scene, setting up a convex corridor, and optimizing for a smooth path, can achieve some neat things. However, driving in unstructured environments like my hometown is more complex, with cars and pedestrians cutting each other, arch breaking, and honking. To efficiently solve these problems at runtime, learning-based methods are used, and this approach is shown to be more effective than traditional methods.

The ego car is a complex problem that requires it to navigate a parking lot, navigating around curbs, parked cars, and cones. The A* standard algorithm uses a lattice-based search, with the heuristic being the distance to the goal. This heuristic is effective but can be tedious and hard to design globally. To improve the search, neural networks are used to generate state and action distributions that can be plugged into Monte Carlo Tree Search with various cost functions.

The vision networks produce a vector space, and the car moves around in it, resembling an Atari game. Techniques such as mu 0, alpha 0, and others are used to solve the problem. These neural networks produce state and action distributions that can be plugged into Monte Carlo Tree Search with various cost functions, such as collisions, comfort, traversal time, and more.

The MCTS Tree Search algorithm is used to train the vision system, which crushes dense video data into a vector space. This vector space is consumed by both an explicit planner and a neural network planner, as well as intermediate features of the network. The trajectory distribution is then optimized end-to-end, using explicit cost functions and human intervention. The final steering and acceleration commands for the car are then produced.

The final architecture of the MCTS Tree Search algorithm is a combination of explicit and neural network planners, which can be optimized end-to-end with explicit cost functions and human intervention. This optimization leads to the final steering and acceleration commands for the car.

In summary, the ego car's search for parking requires a more efficient and efficient approach. By utilizing neural networks and incorporating various cost functions, the MCTS Tree Search algorithm can efficiently navigate a parking lot and achieve better results.

In summary, training neural networks requires large data sets, which are crucial for accurate and diverse algorithms. Neural networks only establish an upper bound on performance, but they also require massive data sets to train the correct algorithms. To accumulate clean and diverse vector space examples, Tesla has evolved its data sets over time. Initially, working with a third party to obtain data sets was not feasible due to high latency and poor quality. To address this issue, Tesla brought all of the labeling in-house, forming a more than 1,000-person data labeling organization with professional labelers working closely with engineers. The team also builds infrastructure for data labeling from scratch, maintaining statistics on latency throughput and quality statistics.

Initially, most labeling was done in image space, which took time to annotate individual images. However, this method was insufficient for millions of vector space labels. Tesla quickly transitioned to three-dimensional or four-dimensional labeling, where labels are directly labeled in vector space, not individual images. This led to a massive increase in throughput for many labels, as labeling once in 3D and then reprojecting those changes into camera images. However, this approach was not enough, as humans and computers have different pros and cons.

In conclusion, the story of data sets and the collaboration between humans and computers is crucial for the development of neural networks. By focusing on the development of vector space data sets, Tesla can better understand the complex interplay between humans and computers in creating accurate and diverse neural networks.

Auto-labeling is an infrastructure developed for labeling large-scale video clips, such as videos, IMU data, GPS, and odometry. This technology requires a massive investment in training networks, which can be done through a massive auto-labeling pipeline. A single clip is an entity with dense sensor data, such as videos, IMU data, GPS, and odometry, and can be uploaded by engineering cars or customer cars. These clips are then sent to servers where neural networks are run offline to produce intermediate results, such as segmentation maps, depth, and point matching. These results are then processed through robotics and AI algorithms to produce a final set of labels that can be used to train the networks.

One of the first tasks is labeling the road surface. Typically, splines or meshes are not differentiable due to topology restrictions. Instead, neural radiance fields work is used to represent the road surface using an implicit representation. This involves querying xy points on the ground and asking the network to predict the height of the ground surface, along with various semantics such as curves, lane boundaries, road surface, travel space, and more. This results in a 3D point that can be reprojected into all camera views.

Thousands of queries are made, and millions of points are reprojected into all camera views. This reprojected point is compared with the image space prediction of the segmentations, and jointly optimizing this over all camera views, both across space and time, produces an excellent reconstruction.

The goal is not to build HD maps, but to label clips through intersections, ensuring consistency with the videos collected. Humans can also clean up any noise or add additional metadata to make the system even richer.

In addition to road surfaces, auto-labeling can also be used to arbitrarily reconstruct 3D static obstacles. This is achieved by reducing the density of the point cloud, which can produce points even on textural surfaces like road surfaces or walls, making it useful for annotating arbitrary obstacles that can be seen in the world.

One of the advantages of doing this offline is the benefit of hindsight, which allows the network to produce accurate kinematics for various scenarios. This is particularly useful when dealing with cars, where the network needs to predict velocity using historical information and guessing the velocity. By looking at both the past and future, the network can stitch different tracks together even through occlusions, making it easier to match and associate them.

Another advantage is that the network can have different tracks, stitched together even through occlusions, which is crucial for planners who need to account for pedestrians even when they are occluded. Combining all these advantages allows for the production of data sets that annotate road texture, static objects, and moving objects, even through occlusions, producing excellent kinematic labels. The system produces smooth labels for cars, pedestrians, and parked cars, ensuring consistent tracking.

The goal is to produce a million labeled clips and train multicam video networks with such large data sets to achieve the same view as in the car. They started their first exploration with the Remove the Radar project, which removed a radar in three months. In low visibility conditions, the network struggled due to the lack of data. To address this issue, the fleet produced lots of similar clips, and the fleet responded by producing videos where snow fell out of other vehicles. These clips were sent through an auto-labeling pipeline that could label 10k clips within a week, which would have taken several months with humans labeling every single clip.

Finally, the team wanted to create a cyber truck into a data set for remote radar. They used simulation to label the data, creating vehicle cuboids with kinematics, depth, surface normals, and segmentation. This allowed Andre to quickly produce new tasks and produce accurate labels.

Simulation helps when data is difficult to source, such as when dealing with complex scenes like a couple and their dog running on the highway while other high-speed cars pass by. It also helps when closed-loop behavior is needed, where cars need to be in determined situations or data depends on actions.

In summary, the use of offline servers offers numerous benefits, including improved kinematics, better data sources, and faster training of multicam video networks. However, further development and testing are needed to ensure the reliability and accuracy of the system's performance in real-world scenarios.

The main objective of sensor simulation is to accurately match the real camera's vision and make hardware decisions such as lens design, camera design, sensor placement, and headlight transmission properties. This involves modeling various camera properties, such as noise, motion blur, optical distortions, and diffraction patterns. The simulation is not only used for Autofill software but also for making hardware decisions such as lens design, camera design, sensor placement, and headlight transmission properties.

To render the visuals realistically, the team has worked on spatiotemporal anti-aliasing, neural rendering techniques, and ray tracing to create realistic lighting and global illumination. These techniques ensure that the simulation does not create jaggies, which are aliasing artifacts that can detract from the simulation.

The Tesla Cars team is working on a new car model that uses real-world assets and locations to create realistic simulation environments. They have thousands of assets in their library, including 2,000 miles of road built across the US, and efficient tooling for single artists to build several miles more per day. Most of the data used to train is created procedurally using algorithms, rather than artists creating simulation scenarios. To improve network performance, they use machine learning techniques to identify failure points and create more data around them.

The team also aims to recreate any failures that happen to the autopilot in simulation, so that they can hold it to the same bar from then on. This involves taking a real clip collected from a car and going through our auto-labeling pipeline to produce a 3D reconstruction of the scene, along with all the moving objects. This, combined with the original visual information, creates a simulation scenario entirely out of it. When replayed on the car, autopilot can do entirely new things and form new worlds and outcomes from the original failure.

Neural rendering techniques are also being used to make the car look even more realistic by taking the original video clip, replicating a synthetic simulation from it, and applying neural rendering techniques on top of it. This results in a very realistic and almost like-it-was-captured result.

To scale these operations, the team has already used 300 million images and nearly half a billion labels. Milan, responsible for the integration of neural networks in the car, explains how they scale these operations and build a label factory. The main constraints for this type of data generation factory include latency and frame rate, as well as getting proper estimates of acceleration and velocity of the surroundings.

The AI compiler that maps compute operations for the PowerTorch model to dedicated and accelerated pieces of hardware is the main problem. They work on optimizing schedules for throughput while working under severe SRAM constraints. They use two engines on the autopilot computer, one of which outputs control commands to the vehicle and the other as an extension of compute. These roles are interchangeable at both the hardware and software level.

To iterate quickly through AI development cycles, the team has been scaling their capacity to evaluate software and neural networks dramatically over the past few years. Today, they run over a million evaluations per week on any code change produced by the team, running on over 3,000 full self-driving computers connected together in a dedicated cluster. Additionally, they have developed debugging tools that help developers iterate through the development of neural networks and compare live outputs from different revisions of the same neural network model.