Feature-based Cross-architecture Self-supervised Knowledge Distillation
More Info
expand_more
Abstract
In the context of open-world scenarios in autonomous vehicles (AVs), previously unseen classes may arise. To address this, effective extraction of well generalizable features is essential for AV downstream tasks, especially in the context of zero-shot learning. This can be achieved using transformers, and Swin Transformer in particular (vision backbone of most Vision-Language Models). However, to enable on-board applications, knowledge distillation must be utilized to create a lightweight model capable of real-time processing. We explore self-supervised knowledge distillation, given that AV datasets need to generalize to previously unseen classes. Our contributions include adapting existing CNN-to-CNN output-based self-supervised knowledge distillation algorithms to Transformer-to-CNN for benchmarking and enhancing them with a cross-architecture loss function. By leveraging DisCo, the best performing output-based self-supervised knowledge distillation method, and using EfficientNetB0 as the student model, we achieve a 3.9% relative improvement in top-1 accuracy over the supervised Swin-T teacher on our modified ImageNet for open-world classification, up to 5.0% with our loss.
Files
File under embargo until 28-10-2026