Outlier Tokens Drive Attention Patterns in Vision Transformers
DOI:
https://doi.org/10.5269/bspm.82398Resumen
The Vision Transformers are widely used in computer vision, as they can capture global image information. However, there is a persistent problem in all these ViT models: a certain number of tokens in their feature representations show abnormally high values at the background level regions. These tokens capture
populated global information while losing essential local details. This leads to attention maps with sharp peaks
that do not correspond to meaningful parts of the image. This problem hugely affects the spatial understanding
of the model and impacts the performance of tasks needing accurate region-based reasoning. It shows up across supervised, self-supervised, and text-supervised settings and also within small ViT models, showing a clear
gap in current systems. This paper identifies this gap and proposes improved transformer-based models that
reduce the effect of these outlier tokens and help in maintaining proper spatial information. Enhancements
made specify the stabilizing of token behavior and supporting better attention distribution across the image
without relying heavily on background regions. Experimental results clearly show the reduction in abnormal
token activity with smoother and more meaningful attention maps. Improved models also showed better
performance in tasks dependent on spatial accuracy. The motive of this work is to make ViTs more reliable,
interpretable, and consistent. By improving spatial reasoning and avoiding misleading attention patterns, the
proposed models will support stronger and more trustworthy visual understanding in real-world applications.
Descargas
Publicado
Número
Sección
Licencia
Derechos de autor 2026 Boletim da Sociedade Paranaense de Matemática

Esta obra está bajo una licencia internacional Creative Commons Atribución 4.0.
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



