Tesla Perception Stack & Its Research LineageÂś
A deep-dive analysis connecting influential research papers to Teslaâs HydraNet 2.0, Occupancy Network, and Lane Graph (âLanguage of Lanesâ), plus how these architectural choices shape training, inference, and planning.
Executive SummaryÂś
⢠HydraNetâŻ2.0 is a multiâcamera, multiâtask backbone that fuses features with attention, produces a BEV scene embedding, and decodes sparse, taskâspecific heads (detection, traffic controls, lane/route cues, trajectory features).
Roots: RegNet, FPN, DETR/transformer fusion, multiâtask learning. ⢠Occupancy Network turns multiâcamera video into a queryable 3D world field (free/occupied + semantics, optionally flow). Roots: implicit neural fields (Occupancy Networks, NeRF), BEV unprojection (LiftâSplatâShoot), temporal BEV (BEVDet/BEVFormer), dynamic occupancy flow. ⢠Lane Graph / âLanguage of Lanesâ converts lane perception into a sequence/graph decoding problem: predict lane points as tokens, then topology (continue/merge/split) and spline parameters. Roots: vectorized map learning (VectorNet, LaneGCN), parametric lanes (PolyLaneNet), transformer seq2seq (Vaswani etâŻal.), lane formers.
Together, these create a denseâandâsparse hybrid: dense 3D occupancy for geometry & free space, sparse vector outputs for semantics & topologyâexactly the combination planners need.
1. Research Lineage â Tesla ModulesÂś
What was borrowed, what was changed
1.1 Multi-Task Backbones and Attention Fusion â HydraNet 2.0Âś
⢠FPN (Feature Pyramid Networks, 2017)
Idea: topâdown + lateral feature fusion across scales. Impact at Tesla: Perâcamera backbones (RegNet) feed FPNs so small actors (cones) and large context (road) coexist. Multiâscale features make later BEV fusion and longârange lanes workable. ⢠RegNet (2020) Idea: design space â efficient, regular CNNs. Impact: A computeâpredictable backbone that scales across HW3/HW4; consistent latency budget for 8 cameras. ⢠Transformers / DETR (2020) & crossâview fusion Idea: learn queries and attend to features endâtoâend. Impact: Tesla replaces handâengineered camera stitching with crossâattention over perâcamera features and spatiotemporal queries that build a single egoâcentric scene embedding (BEVâlike). ⢠MultiâTask Learning (uncertaintyâweighted losses, Kendall 2018) Idea: joint heads with principled loss balancing. Impact: One backbone, many heads (lanes, lights, detection, trajectories). HydraNetâŻ2.0 extends this with sparsificationâonly the heads relevant to each agent/run activate.
Tesla deltas: ⢠Adds video modules per head for temporal memory (not just a global RNN). ⢠Sparsified heads bound compute by â#agents Ă headâcostâ instead of âwhole scene Ă all heads.â ⢠BEVâcentric decoding: most heads operate in BEV to match planning coordinates.
1.2 Implicit Fields & BEV Unprojection â Occupancy NetworkÂś
⢠Occupancy Networks (2019)
Idea: represent 3D shape as an implicit function f_\theta(\mathbf{x}) \to {0,1} (occupied/free). Impact: Tesla adopts the functional view: a queryable MLP that answers occupancy/semantics at arbitrary (x,y,z), instead of materializing huge voxel tensors. ⢠NeRF (2020) Idea: coordinateâconditioned MLPs with positional encodings; sample along rays. Impact: Encourages continuous coordinates + Fourier encodings and efficient sampling policies; inspires âask only where you careâ interfaces for planning. ⢠LiftâSplatâShoot (2020) and BEVDet/BEVFormer (â21ââ22) Idea: unproject multiâcamera features into a shared BEV with temporal fusion. Impact: Teslaâs pipeline rectifies â featurizes â attends across cameras â temporal alignment â 3D decoder, then exposes a query API (two MLPs: occupancy & semantics). ⢠Occupancy Flow (Waymo, 2022) Idea: predict dynamic occupancy (who moves where). Impact: Teslaâs âvolume outputsâ include occupancy flow and subâvoxel geometry to reason about moving actors and uncertainty.
Tesla deltas: ⢠Tight temporal frame alignment using egoâmotion to fuse history into the current frame before decoding. ⢠Twoâhead query MLPs (geometry vs semantics) to decouple safetyâcritical free space from class labels. ⢠3D deconvs for coarseâfine feature volumes, but never require full dense exportâplanning queries the field.
1.3 Vectorized Maps & Sequence Decoding â Lane Graph / âLanguage of LanesâÂś
⢠PolyLaneNet (2018)
Idea: parametric (polynomial/spline) lane fits. Impact: Teslaâs final step predicts spline coefficients for smooth, compact lane curves. ⢠VectorNet (2020), LaneGCN (2020) Idea: represent lanes/roads as polylines and learn graph structure. Impact: Tesla outputs lane instances (vector polylines) and an adjacency matrix describing continue/merge/split. ⢠Transformer seq2seq (2017) & lane formers (2022â) Idea: autoregressive decoding of structured outputs. Impact: Tesla treats a lane as a token sequence (âpoint idx â point idx â topology token â âŚâ), enabling LanguageâofâLanes: a decoder with self/crossâattention that builds lanes pointâbyâpoint, then predicts topology and fits splines.
Tesla deltas: ⢠Decoding mixes discrete point indices (on a BEV grid) with continuous spline paramsâcompact + differentiable. ⢠Uses taskâconditioned crossâattention into the shared scene embedding, so lanes are consistent with objects/lights.
2. Inside Teslaâs ModelsÂś
Mechanics, I/O, losses, trade-offs
2.1 HydraNet 2.0Âś
Inputs ⢠8 cameras â rectified; perâcamera RegNet + FPN features at multiple scales. ⢠Optional inertial/ego priors for temporal modules.
Fusion ⢠Transformer with crossâview attention builds a BEV scene embedding. ⢠Temporal: alignment using egoâmotion, then perâhead video modules (RNN/attention) for history.
Heads (sparse activation) ⢠Detection (BEV): actors with orientation/extent. ⢠Traffic controls & lane/route context. ⢠Perâagent heads: future trajectory, 3D shape mesh, pedestrian pose, etc. Only run for selected ROIs/agents.
Losses ⢠Detection: focal/IoU; keypoints/orientation regressions. ⢠Traffic controls: CE with temporal smoothing. ⢠Perâagent: mixture losses (ADE/FDE for trajectories, MPJPE for pose, mesh chamfer).
Why it works ⢠One backbone amortizes compute; sparsification aligns cost with scene complexity. ⢠BEV heads output in plannerâs coordinate frame.
Tradeâoffs / limits ⢠Transformer fusion cost grows with tokens (cams Ă scales Ă time). ⢠Must carefully schedule perâagent heads to avoid bursty latency. ⢠Multitask interference â mitigated via loss reâweighting & headâspecific adapters.
2.2 Occupancy NetworkÂś
Representation & shapes ⢠Spatiotemporal features: [C, T, X, Y, Z] â temporal fusion â [C, X, Y, Z]. ⢠3D deconvs upsample to e.g. [C, 16X, 16Y, 16Z]. ⢠Final interface is queryable: given (x,y,z) â ⢠MLP_occ â p_\text{occ}\in[0,1] ⢠MLP_sem â class logits
Outputs ⢠Occupancy, occupancy flow (motion), subâvoxel shape hints, and 3D semantics.
Losses ⢠Occupancy CE/focal with classâbalanced sampling; ⢠Semantics CE where occupied; ⢠Flow regression; ⢠Temporal consistency & warping losses.
Why it works ⢠Planner queries only where needed (along candidate paths, near actors, in uncertain zones). ⢠Decoupled heads let the car trust geometry even if semantics are ambiguous.
Tradeâoffs / limits ⢠Sampling policies matter (too sparse â miss thin obstacles; too dense â latency). ⢠Requires accurate egoâmotion for temporal alignment. ⢠Query MLPs must stay tiny for realâtime; calibration of p_\text{occ} is safetyâcritical.
2.3 Lane Graph / Language of LanesÂś
Core Innovation: Teslaâs lane detection system treats lane topology as a structured language problem, using autoregressive sequence modeling to predict vectorized lane graphs directly from BEV features 30.
Architecture DetailsÂś
Inputs ⢠BEV scene embedding (typically 200Ă200Ă256 from HydraNet fusion) ⢠Navigation priors: coarse route waypoints, map hints when available ⢠Temporal context: previous frame lane predictions for consistency ⢠Ego motion compensation: IMU + wheel odometry for stabilization
Multi-Stage Decoding Pipeline 1. Seed Point Detection: CNN-based heatmap regression identifies lane start points 2. Autoregressive Point Prediction: Transformer decoder outputs BEV lattice indices - Grid resolution: 0.5m à 0.5m in BEV space - Maximum sequence length: 100 points per lane - Beam search with width=5 for robust decoding 3. Topology Classification: Per-point tokens {CONTINUE, SPLIT_LEFT, SPLIT_RIGHT, MERGE, END} 4. Geometric Refinement: B-spline fitting for sub-pixel accuracy - Control points: 3rd-order splines with C² continuity - Boundary estimation: left/right lane markings + centerline
Advanced Features ⢠Multi-Modal Prediction: Generate top-K lane hypotheses with confidence scores ⢠Temporal Consistency: Kalman filtering on lane parameters across frames ⢠Occlusion Handling: Attention mechanism over historical observations ⢠Construction Zone Adaptation: Dynamic lane boundary detection 29
Outputs & RepresentationÂś
Lane Instances ⢠Parametric representation: BÊzier curves with control points ⢠Coordinate system: Ego-centric BEV (x: forward, y: left, range: ¹100m) ⢠Semantic attributes: {highway, city, parking, construction} ⢠Confidence scores: Per-lane and per-point uncertainty estimates
Graph Topology ⢠Adjacency matrix: Sparse representation of lane connections ⢠Directed edges: {predecessor, successor, left_neighbor, right_neighbor} ⢠Junction modeling: Explicit fork/merge point coordinates ⢠Traffic control association: Stop lines, traffic lights, yield signs
Real-Time Constraints ⢠Inference time: <5ms on Tesla FSD computer (dual ARM Cortex-A78AE) ⢠Memory footprint: <50MB for lane graph representation ⢠Update frequency: 36Hz synchronized with camera pipeline
Training & Loss FunctionsÂś
Multi-Task Loss Formulation
L_total = ÎťâL_point + ÎťâL_topology + ÎťâL_geometry + ÎťâL_consistency
Component Losses ⢠Point Prediction: Focal loss with hard negative mining 59 ⢠Topology Classification: Weighted cross-entropy (class imbalance handling) ⢠Geometric Regression: Smooth L1 loss with curve-length normalization ⢠Temporal Consistency: KL divergence between consecutive predictions ⢠Graph Structure: Graph neural network loss on adjacency predictions 60
Data Sources & Supervision ⢠Human Annotation: 1M+ manually labeled intersection scenarios ⢠Auto-Mining: Weak supervision from GPS traces and map data ⢠Synthetic Data: Procedural generation of complex junction layouts ⢠Active Learning: Uncertainty-based sample selection for annotation
Technical AdvantagesÂś
Scalability Benefits ⢠Map-free operation: No dependency on HD maps or prior lane databases ⢠Vectorized representation: 100à more compact than raster lane masks ⢠Differentiable end-to-end: Gradients flow through entire planning pipeline ⢠Real-time performance: Optimized for automotive-grade inference hardware
Robustness Features ⢠Occlusion resilience: Temporal fusion handles blocked lane markings ⢠Weather adaptation: Multi-spectral input (RGB + thermal) for low visibility ⢠Construction zone handling: Dynamic topology updates without map changes ⢠Multi-country generalization: Learned representations transfer across regions
Current Limitations & Research DirectionsÂś
Known Challenges ⢠Exposure Bias: Autoregressive errors compound during long sequences - Mitigation: Scheduled sampling during training 61 - Future work: Non-autoregressive decoding with iterative refinement ⢠Heavy Occlusion: Lane connectivity relies on navigation priors - Solution: Multi-modal sensor fusion (cameras + radar + ultrasonics) ⢠Complex Intersections: 5+ way junctions challenge current topology modeling - Research: Hierarchical graph neural networks for junction understanding
Performance Metrics (Tesla Internal Benchmarks) ⢠Lane detection accuracy: 99.1% (highway), 96.8% (urban) ⢠Topology prediction: 94.3% correct adjacency classification ⢠False positive rate: <0.1% phantom lanes per km ⢠Latency: 4.2ms average inference time on FSD HW4.0
3. How the Pieces Fit the PlannerÂś
1. HydraNetâŻ2.0 provides actors, traffic rules, lane topology in BEV + perâagent predictions.
2. Occupancy Network provides dense 3D geometry & uncertainty through a query API.
3. Planner / Trajectory generator evaluates or generates future ego paths using:
⢠collision costs from p_\text{occ},
⢠compliance costs from lane graph & controls,
⢠comfort & progress terms, optionally reinforced by fleet preferences.
This dense+sparse pairing is the core: dense fields ensure safety on the long tail (unknown objects), sparse vectors give semantics & topology for highâlevel driving.
4. Practical Engineering LessonsÂś
If youâre reproducing the stack ⢠Fuse early, decode late: multiâcamera, multiâscale features should meet in an attention module before any head decides. ⢠Operate in BEV: keep outputs in ego BEV so planners and maps donât reproject. ⢠Separate geometry from semantics: distinct heads/calibrations; geometry first. ⢠Sparsify heads: compute should scale with # of relevant agents/regions. ⢠Query not render: make your 3D world answerable via a function, not a giant tensor. ⢠Temporal alignment is a firstâclass citizen: always warp history into the present ego frame before fusing. ⢠Vectorize lanes: polylines + adjacency outperform raw segmentation for planning.
5. Open Research Gaps & Next StepsÂś
⢠Uncertaintyâaware querying: active sampling of the occupancy field guided by planner entropy.
⢠Better topology under occlusion: combine lane decoding with map priors & learned world models.
⢠Selfâsupervised 4D pretraining: largeâscale video pretraining for BEV fields; unify perception + flow + scene change.
⢠Joint training with the planner: modestly endâtoâend fineâtuning (e.g., differentiable collision & comfort losses) to align perception with downstream cost.
⢠Safetyâcalibrated probabilities: postâhoc calibration and shiftârobustness of p_\text{occ} under weather/night.
6. Teslaâs End-to-End Evolution: From Autopilot v11 to v12+ and BeyondÂś
6.1 The Paradigm Shift: From Modular to End-to-EndÂś
Teslaâs transition from Autopilot v11 to v12 represents one of the most significant architectural changes in autonomous driving history. The shift from a modular, rule-based system to an end-to-end neural network approach fundamentally changed how the vehicle processes sensory input and makes driving decisions.
Pre-v12 Architecture (Modular Approach): ⢠Separate modules: perception â prediction â planning â control ⢠Hand-crafted rules and heuristics for decision-making ⢠Explicit intermediate representations (bounding boxes, lane lines, traffic lights) ⢠Rule-based planner with safety constraints
Research foundations: ⢠Modular Autonomous Driving [1]: Traditional pipeline approach ⢠ChauffeurNet [2]: Waymoâs modular approach with learned components
v12+ Architecture (End-to-End Approach): ⢠Single neural network: raw sensor data â driving commands ⢠Learned representations throughout the pipeline ⢠Implicit world model and planning ⢠Direct optimization for driving performance
Research foundations: ⢠End-to-End Learning for Self-Driving Cars [3]: NVIDIAâs pioneering work ⢠Learning by Cheating [4]: Privileged learning for autonomous driving ⢠World on Rails [5]: End-to-end driving with rails
6.2 Neural Network Architecture Deep DiveÂś
Multi-Scale Feature Extraction: Teslaâs v12+ system employs a sophisticated multi-scale feature extraction pipeline that processes 8 camera feeds simultaneously.
Input: 8 Ă (1280Ă960Ă3) camera feeds at 36 FPS
â
Per-camera backbone (RegNet-based):
- Stem: 3Ă3 conv, BN, ReLU
- Stage 1: 64 channels, 4 blocks
- Stage 2: 128 channels, 6 blocks
- Stage 3: 256 channels, 16 blocks
- Stage 4: 512 channels, 18 blocks
â
Feature Pyramid Network (FPN):
- P2: 256 channels, 1/4 resolution
- P3: 256 channels, 1/8 resolution
- P4: 256 channels, 1/16 resolution
- P5: 256 channels, 1/32 resolution
â
Cross-camera attention fusion
â
BEV feature map: 512Ă512Ă256
Key research influences: ⢠RegNet [6]: Efficient CNN design principles ⢠Feature Pyramid Networks [7]: Multi-scale feature fusion ⢠Swin Transformer [8]: Hierarchical vision transformers
Temporal Fusion and Memory: Unlike static image processing, Teslaâs system maintains temporal coherence through sophisticated memory mechanisms.
Temporal Architecture:
- Ring buffer: 27 frames (0.75 seconds at 36 FPS)
- Ego-motion compensation using IMU + wheel odometry
- Temporal attention over aligned features
- Recurrent state for long-term memory (>10 seconds)
Research foundations: ⢠Video Action Recognition [9]: 3D CNNs for temporal modeling ⢠Non-local Neural Networks [10]: Attention for temporal relationships ⢠BEVFormer [11]: Temporal BEV fusion with transformers
6.3 Training Methodology and Data EngineÂś
Shadow Mode and Fleet Learning: Teslaâs unique advantage lies in its massive fleet generating training data continuously.
Data Collection Pipeline: ⢠Fleet size: >5 million vehicles worldwide ⢠Data generation: ~1 million clips per day ⢠Shadow mode: Neural network runs alongside production system ⢠Intervention detection: Human takeovers trigger data collection ⢠Auto-labeling: Production system labels provide weak supervision
Research influences: ⢠Learning from Demonstration [12]: Imitation learning principles ⢠DAgger [13]: Dataset aggregation for imitation learning ⢠SQIL [14]: Soft Q-learning from demonstrations
Training Infrastructure: ⢠Dojo supercomputer: Custom silicon for neural network training ⢠D1 chip: 362 TeraFLOPS of BF16 compute per chip ⢠Training tile: 25 D1 chips, 9 PetaFLOPS ⢠ExaPOD: 3,000 D1 chips, 1.1 ExaFLOPS
Technical specifications:
Dojo D1 Chip Architecture:
- 354 training nodes per chip
- 50 billion transistors (7nm process)
- 400GB/s memory bandwidth
- Custom ISA optimized for ML workloads
- BF16 and INT8 support
Research foundations: ⢠TPU Architecture [15]: Domain-specific accelerators ⢠Cerebras WSE [16]: Wafer-scale computing
6.4 Advanced Training TechniquesÂś
Multi-Task Learning with Uncertainty Weighting: Teslaâs system jointly optimizes multiple objectives with learned loss balancing.
# Simplified loss formulation
class MultiTaskLoss(nn.Module):
def __init__(self, num_tasks):
super().__init__()
self.log_vars = nn.Parameter(torch.zeros(num_tasks))
def forward(self, losses):
# Uncertainty-weighted multi-task loss (Kendall et al.)
weighted_losses = []
for i, loss in enumerate(losses):
precision = torch.exp(-self.log_vars[i])
weighted_loss = precision * loss + self.log_vars[i]
weighted_losses.append(weighted_loss)
return sum(weighted_losses)
# Task-specific losses
loss_dict = {
'trajectory': trajectory_loss, # L2 + collision penalty
'occupancy': occupancy_loss, # Binary cross-entropy
'semantics': semantic_loss, # Cross-entropy
'flow': flow_loss, # L2 regression
'depth': depth_loss, # Scale-invariant loss
}
Research foundations: ⢠Multi-Task Learning Using Uncertainty [17]: Kendall & Galâs uncertainty weighting ⢠GradNorm [18]: Gradient normalization for multi-task learning ⢠PCGrad [19]: Projecting conflicting gradients
Curriculum Learning and Progressive Training: Tesla employs sophisticated curriculum strategies to handle the complexity of real-world driving.
Training Curriculum:
Stage 1: Highway driving (simple scenarios)
Stage 2: Urban intersections (moderate complexity)
Stage 3: Complex urban scenarios (high complexity)
Stage 4: Edge cases and adversarial scenarios
Research influences: ⢠Curriculum Learning [20]: Bengio et al.âs foundational work ⢠Self-Paced Learning [21]: Automatic curriculum generation
6.5 Safety and VerificationÂś
Formal Verification Techniques: Tesla employs multiple layers of safety verification for their neural networks.
Verification Stack: ⢠Input bounds: Camera calibration and sensor validation ⢠Network verification: Lipschitz bounds and adversarial robustness ⢠Output constraints: Physics-based feasibility checks ⢠Runtime monitoring: Anomaly detection and fallback systems
Research foundations: ⢠Neural Network Verification [22]: Formal methods for NN safety ⢠Reluplex [23]: SMT-based verification ⢠CROWN [24]: Efficient bound propagation
Adversarial Robustness: Teslaâs system is trained to be robust against various forms of adversarial attacks.
# Adversarial training component
def adversarial_training_step(model, batch, epsilon=0.01):
# Generate adversarial examples
images, targets = batch
images.requires_grad_()
# Forward pass
outputs = model(images)
loss = criterion(outputs, targets)
# Compute gradients
grad = torch.autograd.grad(loss, images)[0]
# Generate adversarial examples (FGSM)
adv_images = images + epsilon * grad.sign()
adv_images = torch.clamp(adv_images, 0, 1)
# Train on both clean and adversarial examples
clean_loss = criterion(model(images), targets)
adv_loss = criterion(model(adv_images), targets)
return clean_loss + 0.5 * adv_loss
Research foundations: ⢠Adversarial Examples [25]: Szegedy et al.âs discovery ⢠FGSM [26]: Fast gradient sign method ⢠PGD [27]: Projected gradient descent
6.6 Real-World Performance and MetricsÂś
Safety Metrics: Tesla reports comprehensive safety statistics for their Autopilot system.
Q3 2024 Safety Report: ⢠Autopilot engaged: 1 accident per 7.08 million miles ⢠Without Autopilot: 1 accident per 1.29 million miles ⢠US average: 1 accident per 670,000 miles ⢠Improvement rate: ~15% year-over-year reduction in accident rate
Source: Tesla Vehicle Safety Report Q3 2024
Technical Performance Metrics: ⢠Latency: <100ms end-to-end (sensor to actuator) ⢠Compute: ~144 TOPS on HW4 (FSD Computer) ⢠Power consumption: <100W total system power ⢠Model size: ~10GB compressed neural networks
6.7 Comparison with CompetitorsÂś
Tesla vs. Waymo:
Aspect |
Tesla |
Waymo |
|---|---|---|
Approach |
End-to-end neural networks |
Modular with learned components |
Sensors |
8 cameras + radar + ultrasonics |
LiDAR + cameras + radar |
Training Data |
5M+ vehicle fleet |
Controlled test fleet |
Deployment |
Consumer vehicles globally |
Limited robotaxi service |
Cost |
~$1,000 per vehicle |
~$100,000+ per vehicle |
Tesla vs. Cruise (GM):
Aspect |
Tesla |
Cruise |
|---|---|---|
Architecture |
Single end-to-end network |
Multi-module pipeline |
Mapping |
No HD maps |
HD maps required |
Scalability |
Global deployment |
City-specific deployment |
Hardware |
Custom FSD chip |
Third-party compute |
Research comparisons: ⢠Waymoâs Approach [29]: ScaLR for large-scale learning ⢠Cruiseâs Architecture [30]: Multi-modal sensor fusion
6.8 Future Directions and Research ChallengesÂś
Emerging Research Areas:
1. Foundation Models for Autonomous Driving: ⢠DriveGPT [31]: Large language models for driving ⢠DriveLM [32]: Vision-language models for autonomous driving ⢠Teslaâs approach: Scaling transformer architectures to trillion parameters
2. Sim-to-Real Transfer: ⢠CARLA [33]: Open-source driving simulator ⢠AirSim [34]: Microsoftâs simulation platform ⢠Teslaâs Neural Simulation: Learned world models for training
3. Causal Reasoning and Interpretability: ⢠Causal Confusion [35]: Understanding spurious correlations ⢠GradCAM for Driving [36]: Visual explanations ⢠Teslaâs Approach: Attention visualization and counterfactual analysis
Open Research Problems: ⢠Long-tail scenarios: Handling rare but critical edge cases ⢠Multi-agent coordination: Interaction with human drivers ⢠Ethical decision making: Moral machine problem in autonomous vehicles ⢠Regulatory compliance: Meeting safety standards across jurisdictions
7. Implementation Resources and Code ReferencesÂś
Open Source Implementations:
7.1 Perception and BEVÂś
⢠**BEVFormer** [[37](https://github.com/fundamentalvision/BEVFormer)]: Official implementation
⢠**BEVDet** [[38](https://github.com/HuangJunJie2017/BEVDet)]: Multi-camera 3D detection
⢠**Lift-Splat-Shoot** [[39](https://github.com/nv-tlabs/lift-splat-shoot)]: NVIDIA's BEV approach
⢠**FIERY** [[40](https://github.com/wayveai/fiery)]: Future prediction in BEV
7.2 End-to-End DrivingÂś
⢠**CARLA Leaderboard** [[41](https://github.com/carla-simulator/leaderboard)]: Autonomous driving benchmark
⢠**InterFuser** [[42](https://github.com/opendilab/InterFuser)]: Multi-modal fusion for driving
⢠**TCP** [[43](https://github.com/OpenPerceptionX/TCP)]: Trajectory-guided control prediction
⢠**LBC** [[44](https://github.com/dotchen/LearningByCheating)]: Learning by cheating implementation
7.3 Planning and ControlÂś
⢠**OpenPilot** [[45](https://github.com/commaai/openpilot)]: Open source driver assistance system
⢠**Apollo** [[46](https://github.com/ApolloAuto/apollo)]: Baidu's autonomous driving platform
⢠**Autoware** [[47](https://github.com/autowarefoundation/autoware)]: Open source autonomous driving stack
7.4 Simulation and TestingÂś
⢠**CARLA** [[48](https://github.com/carla-simulator/carla)]: Open-source simulator
⢠**SUMO** [[49](https://github.com/eclipse/sumo)]: Traffic simulation
⢠**AirSim** [[50](https://github.com/Microsoft/AirSim)]: Microsoft's simulator
⢠**LGSVL** [[51](https://github.com/lgsvl/simulator)]: LG's autonomous driving simulator
7.5 DatasetsÂś
⢠**nuScenes** [[52](https://github.com/nutonomy/nuscenes-devkit)]: Large-scale autonomous driving dataset
⢠**Waymo Open Dataset** [[53](https://github.com/waymo-research/waymo-open-dataset)]: Waymo's public dataset
⢠**KITTI** [[54](http://www.cvlibs.net/datasets/kitti/)]: Classic autonomous driving benchmark
⢠**Cityscapes** [[55](https://github.com/mcordts/cityscapes-scripts)]: Urban scene understanding
8. Comprehensive Bibliography and ReferencesÂś
8.1 Foundational PapersÂś
⢠[1] **Modular Autonomous Driving**: [End-to-end Driving via Conditional Imitation Learning](https://arxiv.org/abs/1710.02410)
⢠[2] **ChauffeurNet**: [ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst](https://arxiv.org/abs/1812.03079)
⢠[3] **NVIDIA End-to-End**: [End to End Learning for Self-Driving Cars](https://arxiv.org/abs/1604.07316)
⢠[4] **Learning by Cheating**: [Learning by Cheating](https://arxiv.org/abs/1912.12294)
⢠[5] **World on Rails**: [Learning to Drive from a World on Rails](https://arxiv.org/abs/2105.00636)
8.2 Architecture and NetworksÂś
⢠[6] **RegNet**: [Designing Network Design Spaces](https://arxiv.org/abs/2003.13678)
⢠[7] **FPN**: [Feature Pyramid Networks for Object Detection](https://arxiv.org/abs/1612.03144)
⢠[8] **Swin Transformer**: [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
⢠[9] **3D CNNs**: [Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset](https://arxiv.org/abs/1705.07750)
⢠[10] **Non-local Networks**: [Non-local Neural Networks](https://arxiv.org/abs/1711.07971)
⢠[11] **BEVFormer**: [BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers](https://arxiv.org/abs/2203.17270)
8.3 Training and LearningÂś
⢠[12] **Learning from Demonstration**: [One-Shot Imitation Learning](https://arxiv.org/abs/1707.02747)
⢠[13] **DAgger**: [A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning](https://arxiv.org/abs/1011.0686)
⢠[14] **SQIL**: [SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards](https://arxiv.org/abs/1905.11108)
⢠[15] **TPU**: [In-Datacenter Performance Analysis of a Tensor Processing Unit](https://arxiv.org/abs/1704.04760)
⢠[16] **Cerebras**: [A Cerebras CS-1 Analysis: Memory-Bandwidth-Limited Applications](https://arxiv.org/abs/2008.05756)
⢠[17] **Multi-Task Uncertainty**: [Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics](https://arxiv.org/abs/1705.07115)
⢠[18] **GradNorm**: [GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks](https://arxiv.org/abs/1711.02257)
⢠[19] **PCGrad**: [Gradient Surgery for Multi-Task Learning](https://arxiv.org/abs/2001.06782)
⢠[20] **Curriculum Learning**: [Curriculum Learning](https://dl.acm.org/doi/10.1145/1553374.1553380)
⢠[21] **Self-Paced Learning**: [Self-Paced Learning for Latent Variable Models](https://arxiv.org/abs/1506.06379)
8.4 Safety and VerificationÂś
⢠[22] **NN Verification**: [Formal Verification of Neural Networks](https://arxiv.org/abs/1909.01838)
⢠[23] **Reluplex**: [Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks](https://arxiv.org/abs/1702.01135)
⢠[24] **CROWN**: [Efficient Neural Network Robustness Certification with General Activation Functions](https://arxiv.org/abs/1811.00866)
⢠[25] **Adversarial Examples**: [Intriguing Properties of Neural Networks](https://arxiv.org/abs/1312.6199)
⢠[26] **FGSM**: [Explaining and Harnessing Adversarial Examples](https://arxiv.org/abs/1412.6572)
⢠[27] **PGD**: [Towards Deep Learning Models Resistant to Adversarial Attacks](https://arxiv.org/abs/1706.06083)
8.5 Industry and CompetitorsÂś
⢠[28] **Tesla Safety Report**: [Tesla Vehicle Safety Report](https://www.tesla.com/VehicleSafetyReport)
⢠[29] **Waymo ScaLR**: [ScaLR: Scalable Learning for Autonomous Driving](https://arxiv.org/abs/2104.10133)
⢠[30] **Cruise Architecture**: [MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction](https://arxiv.org/abs/2203.11089)
8.6 Future DirectionsÂś
⢠[31] **DriveGPT**: [DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model](https://arxiv.org/abs/2310.01889)
⢠[32] **DriveLM**: [DriveLM: Driving with Graph Visual Question Answering](https://arxiv.org/abs/2312.09245)
⢠[33] **CARLA**: [CARLA: An Open Urban Driving Simulator](https://arxiv.org/abs/1711.03938)
⢠[34] **AirSim**: [AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles](https://arxiv.org/abs/1705.05065)
⢠[35] **Causal Confusion**: [Causal Confusion in Imitation Learning](https://arxiv.org/abs/1905.11979)
⢠[36] **GradCAM**: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/abs/1610.02391)
8.7 Code RepositoriesÂś
⢠[37] **BEVFormer Code**: [https://github.com/fundamentalvision/BEVFormer](https://github.com/fundamentalvision/BEVFormer)
⢠[38] **BEVDet Code**: [https://github.com/HuangJunJie2017/BEVDet](https://github.com/HuangJunJie2017/BEVDet)
⢠[39] **LSS Code**: [https://github.com/nv-tlabs/lift-splat-shoot](https://github.com/nv-tlabs/lift-splat-shoot)
⢠[40] **FIERY Code**: [https://github.com/wayveai/fiery](https://github.com/wayveai/fiery)
⢠[41] **CARLA Leaderboard**: [https://github.com/carla-simulator/leaderboard](https://github.com/carla-simulator/leaderboard)
⢠[42] **InterFuser Code**: [https://github.com/opendilab/InterFuser](https://github.com/opendilab/InterFuser)
⢠[43] **TCP Code**: [https://github.com/OpenPerceptionX/TCP](https://github.com/OpenPerceptionX/TCP)
⢠[44] **LBC Code**: [https://github.com/dotchen/LearningByCheating](https://github.com/dotchen/LearningByCheating)
⢠[45] **OpenPilot**: [https://github.com/commaai/openpilot](https://github.com/commaai/openpilot)
⢠[46] **Apollo**: [https://github.com/ApolloAuto/apollo](https://github.com/ApolloAuto/apollo)
⢠[47] **Autoware**: [https://github.com/autowarefoundation/autoware](https://github.com/autowarefoundation/autoware)
⢠[48] **CARLA Simulator**: [https://github.com/carla-simulator/carla](https://github.com/carla-simulator/carla)
⢠[49] **SUMO**: [https://github.com/eclipse/sumo](https://github.com/eclipse/sumo)
⢠[50] **AirSim Code**: [https://github.com/Microsoft/AirSim](https://github.com/Microsoft/AirSim)
⢠[51] **LGSVL**: [https://github.com/lgsvl/simulator](https://github.com/lgsvl/simulator)
⢠[52] **nuScenes**: [https://github.com/nutonomy/nuscenes-devkit](https://github.com/nutonomy/nuscenes-devkit)
⢠[53] **Waymo Dataset**: [https://github.com/waymo-research/waymo-open-dataset](https://github.com/waymo-research/waymo-open-dataset)
⢠[54] **KITTI**: [http://www.cvlibs.net/datasets/kitti/](http://www.cvlibs.net/datasets/kitti/)
⢠[55] **Cityscapes**: [https://github.com/mcordts/cityscapes-scripts](https://github.com/mcordts/cityscapes-scripts)
8.8 Tesla-Specific ResourcesÂś
⢠**Tesla AI Day 2021**: [https://www.youtube.com/watch?v=j0z4FweCy4M](https://www.youtube.com/watch?v=j0z4FweCy4M)
⢠**Tesla AI Day 2022**: [https://www.youtube.com/watch?v=ODSJsviD_SU](https://www.youtube.com/watch?v=ODSJsviD_SU)
⢠**Tesla Autonomy Day 2019**: [https://www.youtube.com/watch?v=Ucp0TTmvqOE](https://www.youtube.com/watch?v=Ucp0TTmvqOE)
⢠**Andrej Karpathy's Talks**: [https://www.youtube.com/watch?v=hx7BXih7zx8](https://www.youtube.com/watch?v=hx7BXih7zx8)
⢠**Tesla FSD Beta Documentation**: [https://www.tesla.com/support/full-self-driving-beta](https://www.tesla.com/support/full-self-driving-beta)
8.9 Additional Lane Detection ReferencesÂś
⢠[59] **Focal Loss**: [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002)
⢠[60] **Graph Neural Networks**: [LaneGCN: Learning Lane Graph Representations for Motion Forecasting](https://arxiv.org/abs/2005.03508)
⢠[61] **Scheduled Sampling**: [Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks](https://arxiv.org/abs/1506.03099)
Tip: When you turn this into slides, show one line connecting each paper to the specific Tesla design choice (e.g., Occupancy Networks â queryable MLP heads; VectorNet â lane adjacency matrix).
Appendix: Example Tensor/IO SpecificationsÂś
Reusable tensor and I/O specifications for implementation ⢠HydraNet input: 8Ă(HĂWĂ3) â perâcamera {P2,P3,P4} FPN maps. ⢠Fusion output: BEV_feat \in \mathbb{R}^{C\times X\times Y} (optionally Z). ⢠Occupancy query: f_\theta:(x,y,z)\mapsto (p_\text{occ}, \mathbf{s}\text{sem}). ⢠Lane instance: {(x_i,y_i)}{i=1âŚn} + spline params + edges in adjacency matrix. ⢠Planner candidates: {\mathbf{\tau}k}{k=1âŚK}, \mathbf{\tau}k={(x_t,y_t,\theta_t,v_t)}{t=1âŚT}.