In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art per
...
In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.