Image-to-Text Transduction with Spatial Self-Attention
Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pages 43--48 - Apr 2018.
Attention mechanisms have been shown to improve recurrent
encoder-decoder architectures in sequence-to-sequence learning scenarios.
Recently, the Transformer model has been proposed which only applies
dot-product attention and omits recurrent operations to obtain a sourcetarget mapping [5]. In this paper we show that the concepts of self- and
inter-attention can effectively be applied in an image-to-text task. The
encoder applies pre-trained convolution and pooling operations followed
by self-attention to obtain an image feature representation. Self-attention
combines image features of regions based on their similarity before they
are made accessible to the decoder through inter-attention.
@InProceedings{SLWW18, author = "Springenberg, Sebastian and Lakomkin, Egor and Weber, Cornelius and Wermter, Stefan", title = "Image-to-Text Transduction with Spatial Self-Attention", booktitle = "Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)", pages = "43--48", month = "Apr", year = "2018", publisher = "i6doc", url = "https://www2.informatik.uni-hamburg.de/wtm/publications/2018/SLWW18/ESANN_spatial_self_attention_final_.pdf" }