HyenaPixel: Global Image Context with Convolutions
- In vision tasks, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, convolution requires multiple stacked layers and a hierarchical structure for large context. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to the non-causal two-dimensional image space. We scale the Hyena convolution kernels beyond the feature map size up to 191$\times$191 to maximize the ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 83.0% and 83.5%, respectively, while outperforming other large-kernel networks. Combining HyenaPixel with attention further increases accuracy to 83.6%. We attribute the success of attention to the lack of spatial bias in later stages and support this finding with bidirectional Hyena.
Document Type: | Preprint |
---|---|
Language: | English |
Author: | Julian Spravil, Sebastian Houben, Sven Behnke |
Number of pages: | 13 |
ArXiv Id: | http://arxiv.org/abs/2402.19305 |
Publisher: | arXiv |
Date of first publication: | 2024/02/29 |
Funding: | This research has been funded by the Federal Ministry of Education and Research of Germany under grant no. 01IS22094E WEST-AI. |
Departments, institutes and facilities: | Fachbereich Informatik |
Institut für Technik, Ressourcenschonung und Energieeffizienz (TREE) | |
Dewey Decimal Classification (DDC): | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
Entry in this database: | 2024/03/05 |