Outlier Tokens Drive Attention Patterns in Vision Transformers

Autores

  • Nagini S VNR Vignana Jyothi Institute of Engineering and Technology
  • Karnam Akhil
  • Mallupeddi Vamsi Krishna
  • Pathi Sairoop Teja
  • Swapnika Chowdary Thanikonda

DOI:

https://doi.org/10.5269/bspm.82398

Resumo

The Vision Transformers are widely used in computer vision, as they can capture global image information. However, there is a persistent problem in all these ViT models: a certain number of tokens in their feature representations show abnormally high values at the background level regions. These tokens capture
populated global information while losing essential local details. This leads to attention maps with sharp peaks
that do not correspond to meaningful parts of the image. This problem hugely affects the spatial understanding
of the model and impacts the performance of tasks needing accurate region-based reasoning. It shows up across supervised, self-supervised, and text-supervised settings and also within small ViT models, showing a clear
gap in current systems. This paper identifies this gap and proposes improved transformer-based models that
reduce the effect of these outlier tokens and help in maintaining proper spatial information. Enhancements
made specify the stabilizing of token behavior and supporting better attention distribution across the image
without relying heavily on background regions. Experimental results clearly show the reduction in abnormal
token activity with smoother and more meaningful attention maps. Improved models also showed better
performance in tasks dependent on spatial accuracy. The motive of this work is to make ViTs more reliable,
interpretable, and consistent. By improving spatial reasoning and avoiding misleading attention patterns, the
proposed models will support stronger and more trustworthy visual understanding in real-world applications.

Downloads

Publicado

2026-06-19

Edição

Seção

Conf. Issue: Recent Trends in Mathematical Sciences and Technological Applic.

Como Citar

S, N., Karnam Akhil, Mallupeddi Vamsi Krishna, Pathi Sairoop Teja, & Swapnika Chowdary Thanikonda. (2026). Outlier Tokens Drive Attention Patterns in Vision Transformers. Boletim Da Sociedade Paranaense De Matemática, 44(17), 1-32. https://doi.org/10.5269/bspm.82398