It's the Adaptation, Not the Architecture: Pretrained Vision Transformers Are Competitive for End-to-End Steering on Small Driving Data

Asaria, Salomone, Gandhi·June 18, 2026VISION

A DINO-pretrained ViT-S is competitive with a pretrained ResNet-50 at end-to-end steering-angle prediction on a small slice (~5–16k frames) of the comma2k19 driving dataset — turn-slice Pearson 0.964 vs 0.967. Competitiveness is conditional on adaptation (low-LR full fine-tuning, a cost-sensitive loss, dropping flip augmentation); the practitioner defaults of a frozen linear probe and horizontal flips reproduce the usual "ViTs don't work at small scale" conclusion.

Download PDF ↓