Volltext-Downloads (blau) und Frontdoor-Views (grau)
  • search hit 10 of 3714
Back to Result List

HyenaPixel: Global Image Context with Convolutions

  • In vision tasks, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, convolution requires multiple stacked layers and a hierarchical structure for large context. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to the non-causal two-dimensional image space. We scale the Hyena convolution kernels beyond the feature map size up to 191$\times$191 to maximize the ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 83.0% and 83.5%, respectively, while outperforming other large-kernel networks. Combining HyenaPixel with attention further increases accuracy to 83.6%. We attribute the success of attention to the lack of spatial bias in later stages and support this finding with bidirectional Hyena.

Export metadata

Additional Services

Search Google Scholar Check availability

Statistics

Show usage statistics
Metadaten
Document Type:Preprint
Language:English
Author:Julian Spravil, Sebastian Houben, Sven Behnke
Number of pages:13
ArXiv Id:http://arxiv.org/abs/2402.19305
Publisher:arXiv
Date of first publication:2024/02/29
Funding:This research has been funded by the Federal Ministry of Education and Research of Germany under grant no. 01IS22094E WEST-AI.
Departments, institutes and facilities:Fachbereich Informatik
Institut für Technik, Ressourcenschonung und Energieeffizienz (TREE)
Dewey Decimal Classification (DDC):0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Entry in this database:2024/03/05