Location information is essential for the ViT model. Image data has three types of location information: absolute location, relative direction, and relative distance. Various position embeddings methods have been used to introduce location information to the ViT model. Some exist
...
Location information is essential for the ViT model. Image data has three types of location information: absolute location, relative direction, and relative distance. Various position embeddings methods have been used to introduce location information to the ViT model. Some existing methods are absolute position embeddings, relative position embeddings, fixed sinusoidal position embeddings, and learnable Fourier position embeddings. However, it is unclear what type of location information can be encoded by different position embeddings methods. This paper investigates this question by conducting fully-controlled experiments and feature-level analysis on synthetic datasets. The results suggest that the relative position embeddings cannot encode absolute location information, which leads to inferior performance. All the position embeddings approaches that we test can encode relative location information. However, they have different levels of relative location bias. The learnable absolute position embeddings do not contain any relative location bias and therefore need more data to learn. The fixed sinusoidal and learnable Fourier position embeddings are relatively better, but they also have minor drawbacks. The fixed sinusoidal position embeddings are not trainable, while the Fourier method does not have much bias on relative location information. We propose to make the fixed sinusoidal position embeddings learnable and use pretraining tasks to improve the Fourier method. Our two new approaches show promising results on the testing datasets, and they are competitive compared with a similar approach.