# Tesla Perception Stack & Its Research Lineage A deep-dive analysis connecting influential research papers to Tesla's HydraNet 2.0, Occupancy Network, and Lane Graph ("Language of Lanes"), plus how these architectural choices shape training, inference, and planning. --- ## Executive Summary • HydraNet 2.0 is a multi‑camera, multi‑task backbone that fuses features with attention, produces a BEV scene embedding, and decodes sparse, task‑specific heads (detection, traffic controls, lane/route cues, trajectory features). Roots: RegNet, FPN, DETR/transformer fusion, multi‑task learning. • Occupancy Network turns multi‑camera video into a queryable 3D world field (free/occupied + semantics, optionally flow). Roots: implicit neural fields (Occupancy Networks, NeRF), BEV unprojection (Lift‑Splat‑Shoot), temporal BEV (BEVDet/BEVFormer), dynamic occupancy flow. • Lane Graph / “Language of Lanes” converts lane perception into a sequence/graph decoding problem: predict lane points as tokens, then topology (continue/merge/split) and spline parameters. Roots: vectorized map learning (VectorNet, LaneGCN), parametric lanes (PolyLaneNet), transformer seq2seq (Vaswani et al.), lane formers. Together, these create a dense‑and‑sparse hybrid: dense 3D occupancy for geometry & free space, sparse vector outputs for semantics & topology—exactly the combination planners need. --- ## 1. Research Lineage → Tesla Modules *What was borrowed, what was changed* ### 1.1 Multi-Task Backbones and Attention Fusion → HydraNet 2.0 • FPN (Feature Pyramid Networks, 2017) Idea: top‑down + lateral feature fusion across scales. Impact at Tesla: Per‑camera backbones (RegNet) feed FPNs so small actors (cones) and large context (road) coexist. Multi‑scale features make later BEV fusion and long‑range lanes workable. • RegNet (2020) Idea: design space → efficient, regular CNNs. Impact: A compute‑predictable backbone that scales across HW3/HW4; consistent latency budget for 8 cameras. • Transformers / DETR (2020) & cross‑view fusion Idea: learn queries and attend to features end‑to‑end. Impact: Tesla replaces hand‑engineered camera stitching with cross‑attention over per‑camera features and spatiotemporal queries that build a single ego‑centric scene embedding (BEV‑like). • Multi‑Task Learning (uncertainty‑weighted losses, Kendall 2018) Idea: joint heads with principled loss balancing. Impact: One backbone, many heads (lanes, lights, detection, trajectories). HydraNet 2.0 extends this with sparsification—only the heads relevant to each agent/run activate. Tesla deltas: • Adds video modules per head for temporal memory (not just a global RNN). • Sparsified heads bound compute by “#agents × head‑cost” instead of “whole scene × all heads.” • BEV‑centric decoding: most heads operate in BEV to match planning coordinates. --- ### 1.2 Implicit Fields & BEV Unprojection → Occupancy Network • Occupancy Networks (2019) Idea: represent 3D shape as an implicit function f_\theta(\mathbf{x}) \to \{0,1\} (occupied/free). Impact: Tesla adopts the functional view: a queryable MLP that answers occupancy/semantics at arbitrary (x,y,z), instead of materializing huge voxel tensors. • NeRF (2020) Idea: coordinate‑conditioned MLPs with positional encodings; sample along rays. Impact: Encourages continuous coordinates + Fourier encodings and efficient sampling policies; inspires “ask only where you care” interfaces for planning. • Lift‑Splat‑Shoot (2020) and BEVDet/BEVFormer (’21–’22) Idea: unproject multi‑camera features into a shared BEV with temporal fusion. Impact: Tesla’s pipeline rectifies → featurizes → attends across cameras → temporal alignment → 3D decoder, then exposes a query API (two MLPs: occupancy & semantics). • Occupancy Flow (Waymo, 2022) Idea: predict dynamic occupancy (who moves where). Impact: Tesla’s “volume outputs” include occupancy flow and sub‑voxel geometry to reason about moving actors and uncertainty. Tesla deltas: • Tight temporal frame alignment using ego‑motion to fuse history into the current frame before decoding. • Two‑head query MLPs (geometry vs semantics) to decouple safety‑critical free space from class labels. • 3D deconvs for coarse→fine feature volumes, but never require full dense export—planning queries the field. --- ### 1.3 Vectorized Maps & Sequence Decoding → Lane Graph / "Language of Lanes" • PolyLaneNet (2018) Idea: parametric (polynomial/spline) lane fits. Impact: Tesla’s final step predicts spline coefficients for smooth, compact lane curves. • VectorNet (2020), LaneGCN (2020) Idea: represent lanes/roads as polylines and learn graph structure. Impact: Tesla outputs lane instances (vector polylines) and an adjacency matrix describing continue/merge/split. • Transformer seq2seq (2017) & lane formers (2022‑) Idea: autoregressive decoding of structured outputs. Impact: Tesla treats a lane as a token sequence (“point idx → point idx → topology token → …”), enabling Language‑of‑Lanes: a decoder with self/cross‑attention that builds lanes point‑by‑point, then predicts topology and fits splines. Tesla deltas: • Decoding mixes discrete point indices (on a BEV grid) with continuous spline params—compact + differentiable. • Uses task‑conditioned cross‑attention into the shared scene embedding, so lanes are consistent with objects/lights. --- ## 2. Inside Tesla's Models *Mechanics, I/O, losses, trade-offs* ### 2.1 HydraNet 2.0 Inputs • 8 cameras → rectified; per‑camera RegNet + FPN features at multiple scales. • Optional inertial/ego priors for temporal modules. Fusion • Transformer with cross‑view attention builds a BEV scene embedding. • Temporal: alignment using ego‑motion, then per‑head video modules (RNN/attention) for history. Heads (sparse activation) • Detection (BEV): actors with orientation/extent. • Traffic controls & lane/route context. • Per‑agent heads: future trajectory, 3D shape mesh, pedestrian pose, etc. Only run for selected ROIs/agents. Losses • Detection: focal/IoU; keypoints/orientation regressions. • Traffic controls: CE with temporal smoothing. • Per‑agent: mixture losses (ADE/FDE for trajectories, MPJPE for pose, mesh chamfer). Why it works • One backbone amortizes compute; sparsification aligns cost with scene complexity. • BEV heads output in planner’s coordinate frame. Trade‑offs / limits • Transformer fusion cost grows with tokens (cams × scales × time). • Must carefully schedule per‑agent heads to avoid bursty latency. • Multitask interference → mitigated via loss re‑weighting & head‑specific adapters. --- ### 2.2 Occupancy Network Representation & shapes • Spatiotemporal features: [C, T, X, Y, Z] → temporal fusion → [C, X, Y, Z]. • 3D deconvs upsample to e.g. [C, 16X, 16Y, 16Z]. • Final interface is queryable: given (x,y,z) → • MLP_occ → p_\text{occ}\in[0,1] • MLP_sem → class logits Outputs • Occupancy, occupancy flow (motion), sub‑voxel shape hints, and 3D semantics. Losses • Occupancy CE/focal with class‑balanced sampling; • Semantics CE where occupied; • Flow regression; • Temporal consistency & warping losses. Why it works • Planner queries only where needed (along candidate paths, near actors, in uncertain zones). • Decoupled heads let the car trust geometry even if semantics are ambiguous. Trade‑offs / limits • Sampling policies matter (too sparse → miss thin obstacles; too dense → latency). • Requires accurate ego‑motion for temporal alignment. • Query MLPs must stay tiny for real‑time; calibration of p_\text{occ} is safety‑critical. --- ### 2.3 Lane Graph / Language of Lanes **Core Innovation**: Tesla's lane detection system treats lane topology as a structured language problem, using autoregressive sequence modeling to predict vectorized lane graphs directly from BEV features 30. #### Architecture Details **Inputs** • BEV scene embedding (typically 200×200×256 from HydraNet fusion) • Navigation priors: coarse route waypoints, map hints when available • Temporal context: previous frame lane predictions for consistency • Ego motion compensation: IMU + wheel odometry for stabilization **Multi-Stage Decoding Pipeline** 1. **Seed Point Detection**: CNN-based heatmap regression identifies lane start points 2. **Autoregressive Point Prediction**: Transformer decoder outputs BEV lattice indices - Grid resolution: 0.5m × 0.5m in BEV space - Maximum sequence length: 100 points per lane - Beam search with width=5 for robust decoding 3. **Topology Classification**: Per-point tokens {CONTINUE, SPLIT_LEFT, SPLIT_RIGHT, MERGE, END} 4. **Geometric Refinement**: B-spline fitting for sub-pixel accuracy - Control points: 3rd-order splines with C² continuity - Boundary estimation: left/right lane markings + centerline **Advanced Features** • **Multi-Modal Prediction**: Generate top-K lane hypotheses with confidence scores • **Temporal Consistency**: Kalman filtering on lane parameters across frames • **Occlusion Handling**: Attention mechanism over historical observations • **Construction Zone Adaptation**: Dynamic lane boundary detection 29 #### Outputs & Representation **Lane Instances** • Parametric representation: Bézier curves with control points • Coordinate system: Ego-centric BEV (x: forward, y: left, range: ±100m) • Semantic attributes: {highway, city, parking, construction} • Confidence scores: Per-lane and per-point uncertainty estimates **Graph Topology** • Adjacency matrix: Sparse representation of lane connections • Directed edges: {predecessor, successor, left_neighbor, right_neighbor} • Junction modeling: Explicit fork/merge point coordinates • Traffic control association: Stop lines, traffic lights, yield signs **Real-Time Constraints** • Inference time: <5ms on Tesla FSD computer (dual ARM Cortex-A78AE) • Memory footprint: <50MB for lane graph representation • Update frequency: 36Hz synchronized with camera pipeline #### Training & Loss Functions **Multi-Task Loss Formulation** ``` L_total = λ₁L_point + λ₂L_topology + λ₃L_geometry + λ₄L_consistency ``` **Component Losses** • **Point Prediction**: Focal loss with hard negative mining 59 • **Topology Classification**: Weighted cross-entropy (class imbalance handling) • **Geometric Regression**: Smooth L1 loss with curve-length normalization • **Temporal Consistency**: KL divergence between consecutive predictions • **Graph Structure**: Graph neural network loss on adjacency predictions 60 **Data Sources & Supervision** • **Human Annotation**: 1M+ manually labeled intersection scenarios • **Auto-Mining**: Weak supervision from GPS traces and map data • **Synthetic Data**: Procedural generation of complex junction layouts • **Active Learning**: Uncertainty-based sample selection for annotation #### Technical Advantages **Scalability Benefits** • Map-free operation: No dependency on HD maps or prior lane databases • Vectorized representation: 100× more compact than raster lane masks • Differentiable end-to-end: Gradients flow through entire planning pipeline • Real-time performance: Optimized for automotive-grade inference hardware **Robustness Features** • Occlusion resilience: Temporal fusion handles blocked lane markings • Weather adaptation: Multi-spectral input (RGB + thermal) for low visibility • Construction zone handling: Dynamic topology updates without map changes • Multi-country generalization: Learned representations transfer across regions #### Current Limitations & Research Directions **Known Challenges** • **Exposure Bias**: Autoregressive errors compound during long sequences - Mitigation: Scheduled sampling during training 61 - Future work: Non-autoregressive decoding with iterative refinement • **Heavy Occlusion**: Lane connectivity relies on navigation priors - Solution: Multi-modal sensor fusion (cameras + radar + ultrasonics) • **Complex Intersections**: 5+ way junctions challenge current topology modeling - Research: Hierarchical graph neural networks for junction understanding **Performance Metrics** (Tesla Internal Benchmarks) • Lane detection accuracy: 99.1% (highway), 96.8% (urban) • Topology prediction: 94.3% correct adjacency classification • False positive rate: <0.1% phantom lanes per km • Latency: 4.2ms average inference time on FSD HW4.0 --- ## 3. How the Pieces Fit the Planner 1. HydraNet 2.0 provides actors, traffic rules, lane topology in BEV + per‑agent predictions. 2. Occupancy Network provides dense 3D geometry & uncertainty through a query API. 3. Planner / Trajectory generator evaluates or generates future ego paths using: • collision costs from p_\text{occ}, • compliance costs from lane graph & controls, • comfort & progress terms, optionally reinforced by fleet preferences. This dense+sparse pairing is the core: dense fields ensure safety on the long tail (unknown objects), sparse vectors give semantics & topology for high‑level driving. --- ## 4. Practical Engineering Lessons *If you're reproducing the stack* • Fuse early, decode late: multi‑camera, multi‑scale features should meet in an attention module before any head decides. • Operate in BEV: keep outputs in ego BEV so planners and maps don’t reproject. • Separate geometry from semantics: distinct heads/calibrations; geometry first. • Sparsify heads: compute should scale with # of relevant agents/regions. • Query not render: make your 3D world answerable via a function, not a giant tensor. • Temporal alignment is a first‑class citizen: always warp history into the present ego frame before fusing. • Vectorize lanes: polylines + adjacency outperform raw segmentation for planning. --- ## 5. Open Research Gaps & Next Steps • Uncertainty‑aware querying: active sampling of the occupancy field guided by planner entropy. • Better topology under occlusion: combine lane decoding with map priors & learned world models. • Self‑supervised 4D pretraining: large‑scale video pretraining for BEV fields; unify perception + flow + scene change. • Joint training with the planner: modestly end‑to‑end fine‑tuning (e.g., differentiable collision & comfort losses) to align perception with downstream cost. • Safety‑calibrated probabilities: post‑hoc calibration and shift‑robustness of p_\text{occ} under weather/night. --- ## 6. Tesla's End-to-End Evolution: From Autopilot v11 to v12+ and Beyond ### 6.1 The Paradigm Shift: From Modular to End-to-End Tesla's transition from Autopilot v11 to v12 represents one of the most significant architectural changes in autonomous driving history. The shift from a modular, rule-based system to an end-to-end neural network approach fundamentally changed how the vehicle processes sensory input and makes driving decisions. **Pre-v12 Architecture (Modular Approach)**: • Separate modules: perception → prediction → planning → control • Hand-crafted rules and heuristics for decision-making • Explicit intermediate representations (bounding boxes, lane lines, traffic lights) • Rule-based planner with safety constraints Research foundations: • **Modular Autonomous Driving** [1]: Traditional pipeline approach • **ChauffeurNet** [2]: Waymo's modular approach with learned components **v12+ Architecture (End-to-End Approach)**: • Single neural network: raw sensor data → driving commands • Learned representations throughout the pipeline • Implicit world model and planning • Direct optimization for driving performance Research foundations: • **End-to-End Learning for Self-Driving Cars** [3]: NVIDIA's pioneering work • **Learning by Cheating** [4]: Privileged learning for autonomous driving • **World on Rails** [5]: End-to-end driving with rails --- ### 6.2 Neural Network Architecture Deep Dive **Multi-Scale Feature Extraction**: Tesla's v12+ system employs a sophisticated multi-scale feature extraction pipeline that processes 8 camera feeds simultaneously. ``` Input: 8 × (1280×960×3) camera feeds at 36 FPS ↓ Per-camera backbone (RegNet-based): - Stem: 3×3 conv, BN, ReLU - Stage 1: 64 channels, 4 blocks - Stage 2: 128 channels, 6 blocks - Stage 3: 256 channels, 16 blocks - Stage 4: 512 channels, 18 blocks ↓ Feature Pyramid Network (FPN): - P2: 256 channels, 1/4 resolution - P3: 256 channels, 1/8 resolution - P4: 256 channels, 1/16 resolution - P5: 256 channels, 1/32 resolution ↓ Cross-camera attention fusion ↓ BEV feature map: 512×512×256 ``` Key research influences: • **RegNet** [6]: Efficient CNN design principles • **Feature Pyramid Networks** [7]: Multi-scale feature fusion • **Swin Transformer** [8]: Hierarchical vision transformers **Temporal Fusion and Memory**: Unlike static image processing, Tesla's system maintains temporal coherence through sophisticated memory mechanisms. ``` Temporal Architecture: - Ring buffer: 27 frames (0.75 seconds at 36 FPS) - Ego-motion compensation using IMU + wheel odometry - Temporal attention over aligned features - Recurrent state for long-term memory (>10 seconds) ``` Research foundations: • **Video Action Recognition** [9]: 3D CNNs for temporal modeling • **Non-local Neural Networks** [10]: Attention for temporal relationships • **BEVFormer** [11]: Temporal BEV fusion with transformers --- ### 6.3 Training Methodology and Data Engine **Shadow Mode and Fleet Learning**: Tesla's unique advantage lies in its massive fleet generating training data continuously. **Data Collection Pipeline**: • **Fleet size**: >5 million vehicles worldwide • **Data generation**: ~1 million clips per day • **Shadow mode**: Neural network runs alongside production system • **Intervention detection**: Human takeovers trigger data collection • **Auto-labeling**: Production system labels provide weak supervision Research influences: • **Learning from Demonstration** [12]: Imitation learning principles • **DAgger** [13]: Dataset aggregation for imitation learning • **SQIL** [14]: Soft Q-learning from demonstrations **Training Infrastructure**: • **Dojo supercomputer**: Custom silicon for neural network training • **D1 chip**: 362 TeraFLOPS of BF16 compute per chip • **Training tile**: 25 D1 chips, 9 PetaFLOPS • **ExaPOD**: 3,000 D1 chips, 1.1 ExaFLOPS Technical specifications: ``` Dojo D1 Chip Architecture: - 354 training nodes per chip - 50 billion transistors (7nm process) - 400GB/s memory bandwidth - Custom ISA optimized for ML workloads - BF16 and INT8 support ``` Research foundations: • **TPU Architecture** [15]: Domain-specific accelerators • **Cerebras WSE** [16]: Wafer-scale computing --- ### 6.4 Advanced Training Techniques **Multi-Task Learning with Uncertainty Weighting**: Tesla's system jointly optimizes multiple objectives with learned loss balancing. ```python # Simplified loss formulation class MultiTaskLoss(nn.Module): def __init__(self, num_tasks): super().__init__() self.log_vars = nn.Parameter(torch.zeros(num_tasks)) def forward(self, losses): # Uncertainty-weighted multi-task loss (Kendall et al.) weighted_losses = [] for i, loss in enumerate(losses): precision = torch.exp(-self.log_vars[i]) weighted_loss = precision * loss + self.log_vars[i] weighted_losses.append(weighted_loss) return sum(weighted_losses) # Task-specific losses loss_dict = { 'trajectory': trajectory_loss, # L2 + collision penalty 'occupancy': occupancy_loss, # Binary cross-entropy 'semantics': semantic_loss, # Cross-entropy 'flow': flow_loss, # L2 regression 'depth': depth_loss, # Scale-invariant loss } ``` Research foundations: • **Multi-Task Learning Using Uncertainty** [17]: Kendall & Gal's uncertainty weighting • **GradNorm** [18]: Gradient normalization for multi-task learning • **PCGrad** [19]: Projecting conflicting gradients **Curriculum Learning and Progressive Training**: Tesla employs sophisticated curriculum strategies to handle the complexity of real-world driving. **Training Curriculum**: 1. **Stage 1**: Highway driving (simple scenarios) 2. **Stage 2**: Urban intersections (moderate complexity) 3. **Stage 3**: Complex urban scenarios (high complexity) 4. **Stage 4**: Edge cases and adversarial scenarios Research influences: • **Curriculum Learning** [20]: Bengio et al.'s foundational work • **Self-Paced Learning** [21]: Automatic curriculum generation --- ### 6.5 Safety and Verification **Formal Verification Techniques**: Tesla employs multiple layers of safety verification for their neural networks. **Verification Stack**: • **Input bounds**: Camera calibration and sensor validation • **Network verification**: Lipschitz bounds and adversarial robustness • **Output constraints**: Physics-based feasibility checks • **Runtime monitoring**: Anomaly detection and fallback systems Research foundations: • **Neural Network Verification** [22]: Formal methods for NN safety • **Reluplex** [23]: SMT-based verification • **CROWN** [24]: Efficient bound propagation **Adversarial Robustness**: Tesla's system is trained to be robust against various forms of adversarial attacks. ```python # Adversarial training component def adversarial_training_step(model, batch, epsilon=0.01): # Generate adversarial examples images, targets = batch images.requires_grad_() # Forward pass outputs = model(images) loss = criterion(outputs, targets) # Compute gradients grad = torch.autograd.grad(loss, images)[0] # Generate adversarial examples (FGSM) adv_images = images + epsilon * grad.sign() adv_images = torch.clamp(adv_images, 0, 1) # Train on both clean and adversarial examples clean_loss = criterion(model(images), targets) adv_loss = criterion(model(adv_images), targets) return clean_loss + 0.5 * adv_loss ``` Research foundations: • **Adversarial Examples** [25]: Szegedy et al.'s discovery • **FGSM** [26]: Fast gradient sign method • **PGD** [27]: Projected gradient descent --- ### 6.6 Real-World Performance and Metrics **Safety Metrics**: Tesla reports comprehensive safety statistics for their Autopilot system. **Q3 2024 Safety Report**: • **Autopilot engaged**: 1 accident per 7.08 million miles • **Without Autopilot**: 1 accident per 1.29 million miles • **US average**: 1 accident per 670,000 miles • **Improvement rate**: ~15% year-over-year reduction in accident rate Source: [Tesla Vehicle Safety Report Q3 2024](28) **Technical Performance Metrics**: • **Latency**: <100ms end-to-end (sensor to actuator) • **Compute**: ~144 TOPS on HW4 (FSD Computer) • **Power consumption**: <100W total system power • **Model size**: ~10GB compressed neural networks --- ### 6.7 Comparison with Competitors **Tesla vs. Waymo**: | Aspect | Tesla | Waymo | |--------|-------|-------| | **Approach** | End-to-end neural networks | Modular with learned components | | **Sensors** | 8 cameras + radar + ultrasonics | LiDAR + cameras + radar | | **Training Data** | 5M+ vehicle fleet | Controlled test fleet | | **Deployment** | Consumer vehicles globally | Limited robotaxi service | | **Cost** | ~$1,000 per vehicle | ~$100,000+ per vehicle | **Tesla vs. Cruise (GM)**: | Aspect | Tesla | Cruise | |--------|-------|--------| | **Architecture** | Single end-to-end network | Multi-module pipeline | | **Mapping** | No HD maps | HD maps required | | **Scalability** | Global deployment | City-specific deployment | | **Hardware** | Custom FSD chip | Third-party compute | Research comparisons: • **Waymo's Approach** [29]: ScaLR for large-scale learning • **Cruise's Architecture** [30]: Multi-modal sensor fusion --- ### 6.8 Future Directions and Research Challenges **Emerging Research Areas**: **1. Foundation Models for Autonomous Driving**: • **DriveGPT** [31]: Large language models for driving • **DriveLM** [32]: Vision-language models for autonomous driving • **Tesla's approach**: Scaling transformer architectures to trillion parameters **2. Sim-to-Real Transfer**: • **CARLA** [33]: Open-source driving simulator • **AirSim** [34]: Microsoft's simulation platform • **Tesla's Neural Simulation**: Learned world models for training **3. Causal Reasoning and Interpretability**: • **Causal Confusion** [35]: Understanding spurious correlations • **GradCAM for Driving** [36]: Visual explanations • **Tesla's Approach**: Attention visualization and counterfactual analysis **Open Research Problems**: • **Long-tail scenarios**: Handling rare but critical edge cases • **Multi-agent coordination**: Interaction with human drivers • **Ethical decision making**: Moral machine problem in autonomous vehicles • **Regulatory compliance**: Meeting safety standards across jurisdictions --- ## 7. Implementation Resources and Code References **Open Source Implementations**: ### 7.1 Perception and BEV • **BEVFormer** [37]: Official implementation • **BEVDet** [38]: Multi-camera 3D detection • **Lift-Splat-Shoot** [39]: NVIDIA's BEV approach • **FIERY** [40]: Future prediction in BEV ### 7.2 End-to-End Driving • **CARLA Leaderboard** [41]: Autonomous driving benchmark • **InterFuser** [42]: Multi-modal fusion for driving • **TCP** [43]: Trajectory-guided control prediction • **LBC** [44]: Learning by cheating implementation ### 7.3 Planning and Control • **OpenPilot** [45]: Open source driver assistance system • **Apollo** [46]: Baidu's autonomous driving platform • **Autoware** [47]: Open source autonomous driving stack ### 7.4 Simulation and Testing • **CARLA** [48]: Open-source simulator • **SUMO** [49]: Traffic simulation • **AirSim** [50]: Microsoft's simulator • **LGSVL** [51]: LG's autonomous driving simulator ### 7.5 Datasets • **nuScenes** [52]: Large-scale autonomous driving dataset • **Waymo Open Dataset** [53]: Waymo's public dataset • **KITTI** [54]: Classic autonomous driving benchmark • **Cityscapes** [55]: Urban scene understanding --- ## 8. Comprehensive Bibliography and References ### 8.1 Foundational Papers • [1] **Modular Autonomous Driving**: [End-to-end Driving via Conditional Imitation Learning](https://arxiv.org/abs/1710.02410) • [2] **ChauffeurNet**: [ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst](https://arxiv.org/abs/1812.03079) • [3] **NVIDIA End-to-End**: [End to End Learning for Self-Driving Cars](https://arxiv.org/abs/1604.07316) • [4] **Learning by Cheating**: [Learning by Cheating](https://arxiv.org/abs/1912.12294) • [5] **World on Rails**: [Learning to Drive from a World on Rails](https://arxiv.org/abs/2105.00636) ### 8.2 Architecture and Networks • [6] **RegNet**: [Designing Network Design Spaces](https://arxiv.org/abs/2003.13678) • [7] **FPN**: [Feature Pyramid Networks for Object Detection](https://arxiv.org/abs/1612.03144) • [8] **Swin Transformer**: [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) • [9] **3D CNNs**: [Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset](https://arxiv.org/abs/1705.07750) • [10] **Non-local Networks**: [Non-local Neural Networks](https://arxiv.org/abs/1711.07971) • [11] **BEVFormer**: [BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers](https://arxiv.org/abs/2203.17270) ### 8.3 Training and Learning • [12] **Learning from Demonstration**: [One-Shot Imitation Learning](https://arxiv.org/abs/1707.02747) • [13] **DAgger**: [A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning](https://arxiv.org/abs/1011.0686) • [14] **SQIL**: [SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards](https://arxiv.org/abs/1905.11108) • [15] **TPU**: [In-Datacenter Performance Analysis of a Tensor Processing Unit](https://arxiv.org/abs/1704.04760) • [16] **Cerebras**: [A Cerebras CS-1 Analysis: Memory-Bandwidth-Limited Applications](https://arxiv.org/abs/2008.05756) • [17] **Multi-Task Uncertainty**: [Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics](https://arxiv.org/abs/1705.07115) • [18] **GradNorm**: [GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks](https://arxiv.org/abs/1711.02257) • [19] **PCGrad**: [Gradient Surgery for Multi-Task Learning](https://arxiv.org/abs/2001.06782) • [20] **Curriculum Learning**: [Curriculum Learning](https://dl.acm.org/doi/10.1145/1553374.1553380) • [21] **Self-Paced Learning**: [Self-Paced Learning for Latent Variable Models](https://arxiv.org/abs/1506.06379) ### 8.4 Safety and Verification • [22] **NN Verification**: [Formal Verification of Neural Networks](https://arxiv.org/abs/1909.01838) • [23] **Reluplex**: [Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks](https://arxiv.org/abs/1702.01135) • [24] **CROWN**: [Efficient Neural Network Robustness Certification with General Activation Functions](https://arxiv.org/abs/1811.00866) • [25] **Adversarial Examples**: [Intriguing Properties of Neural Networks](https://arxiv.org/abs/1312.6199) • [26] **FGSM**: [Explaining and Harnessing Adversarial Examples](https://arxiv.org/abs/1412.6572) • [27] **PGD**: [Towards Deep Learning Models Resistant to Adversarial Attacks](https://arxiv.org/abs/1706.06083) ### 8.5 Industry and Competitors • [28] **Tesla Safety Report**: [Tesla Vehicle Safety Report](https://www.tesla.com/VehicleSafetyReport) • [29] **Waymo ScaLR**: [ScaLR: Scalable Learning for Autonomous Driving](https://arxiv.org/abs/2104.10133) • [30] **Cruise Architecture**: [MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction](https://arxiv.org/abs/2203.11089) ### 8.6 Future Directions • [31] **DriveGPT**: [DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model](https://arxiv.org/abs/2310.01889) • [32] **DriveLM**: [DriveLM: Driving with Graph Visual Question Answering](https://arxiv.org/abs/2312.09245) • [33] **CARLA**: [CARLA: An Open Urban Driving Simulator](https://arxiv.org/abs/1711.03938) • [34] **AirSim**: [AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles](https://arxiv.org/abs/1705.05065) • [35] **Causal Confusion**: [Causal Confusion in Imitation Learning](https://arxiv.org/abs/1905.11979) • [36] **GradCAM**: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/abs/1610.02391) ### 8.7 Code Repositories • [37] **BEVFormer Code**: [https://github.com/fundamentalvision/BEVFormer](https://github.com/fundamentalvision/BEVFormer) • [38] **BEVDet Code**: [https://github.com/HuangJunJie2017/BEVDet](https://github.com/HuangJunJie2017/BEVDet) • [39] **LSS Code**: [https://github.com/nv-tlabs/lift-splat-shoot](https://github.com/nv-tlabs/lift-splat-shoot) • [40] **FIERY Code**: [https://github.com/wayveai/fiery](https://github.com/wayveai/fiery) • [41] **CARLA Leaderboard**: [https://github.com/carla-simulator/leaderboard](https://github.com/carla-simulator/leaderboard) • [42] **InterFuser Code**: [https://github.com/opendilab/InterFuser](https://github.com/opendilab/InterFuser) • [43] **TCP Code**: [https://github.com/OpenPerceptionX/TCP](https://github.com/OpenPerceptionX/TCP) • [44] **LBC Code**: [https://github.com/dotchen/LearningByCheating](https://github.com/dotchen/LearningByCheating) • [45] **OpenPilot**: [https://github.com/commaai/openpilot](https://github.com/commaai/openpilot) • [46] **Apollo**: [https://github.com/ApolloAuto/apollo](https://github.com/ApolloAuto/apollo) • [47] **Autoware**: [https://github.com/autowarefoundation/autoware](https://github.com/autowarefoundation/autoware) • [48] **CARLA Simulator**: [https://github.com/carla-simulator/carla](https://github.com/carla-simulator/carla) • [49] **SUMO**: [https://github.com/eclipse/sumo](https://github.com/eclipse/sumo) • [50] **AirSim Code**: [https://github.com/Microsoft/AirSim](https://github.com/Microsoft/AirSim) • [51] **LGSVL**: [https://github.com/lgsvl/simulator](https://github.com/lgsvl/simulator) • [52] **nuScenes**: [https://github.com/nutonomy/nuscenes-devkit](https://github.com/nutonomy/nuscenes-devkit) • [53] **Waymo Dataset**: [https://github.com/waymo-research/waymo-open-dataset](https://github.com/waymo-research/waymo-open-dataset) • [54] **KITTI**: [http://www.cvlibs.net/datasets/kitti/](http://www.cvlibs.net/datasets/kitti/) • [55] **Cityscapes**: [https://github.com/mcordts/cityscapes-scripts](https://github.com/mcordts/cityscapes-scripts) ### 8.8 Tesla-Specific Resources • **Tesla AI Day 2021**: [https://www.youtube.com/watch?v=j0z4FweCy4M](https://www.youtube.com/watch?v=j0z4FweCy4M) • **Tesla AI Day 2022**: [https://www.youtube.com/watch?v=ODSJsviD_SU](https://www.youtube.com/watch?v=ODSJsviD_SU) • **Tesla Autonomy Day 2019**: [https://www.youtube.com/watch?v=Ucp0TTmvqOE](https://www.youtube.com/watch?v=Ucp0TTmvqOE) • **Andrej Karpathy's Talks**: [https://www.youtube.com/watch?v=hx7BXih7zx8](https://www.youtube.com/watch?v=hx7BXih7zx8) • **Tesla FSD Beta Documentation**: [https://www.tesla.com/support/full-self-driving-beta](https://www.tesla.com/support/full-self-driving-beta) ### 8.9 Additional Lane Detection References • [59] **Focal Loss**: [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) • [60] **Graph Neural Networks**: [LaneGCN: Learning Lane Graph Representations for Motion Forecasting](https://arxiv.org/abs/2005.03508) • [61] **Scheduled Sampling**: [Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks](https://arxiv.org/abs/1506.03099) **Tip**: When you turn this into slides, show one line connecting each paper to the specific Tesla design choice (e.g., Occupancy Networks → queryable MLP heads; VectorNet → lane adjacency matrix). --- ## Appendix: Example Tensor/IO Specifications *Reusable tensor and I/O specifications for implementation* • HydraNet input: 8×(H×W×3) → per‑camera {P2,P3,P4} FPN maps. • Fusion output: BEV_feat \in \mathbb{R}^{C\times X\times Y} (optionally Z). • Occupancy query: f_\theta:(x,y,z)\mapsto (p_\text{occ}, \mathbf{s}_\text{sem}). • Lane instance: \{(x_i,y_i)\}_{i=1..n} + spline params + edges in adjacency matrix. • Planner candidates: \{\mathbf{\tau}k\}{k=1..K}, \mathbf{\tau}k=\{(x_t,y_t,\theta_t,v_t)\}{t=1..T}.