Image-to-Text Transduction with Spatial Self-Attention

Sebastian Springenberg, Egor Lakomkin, Cornelius Weber, Stefan Wermter
Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pages 43--48 - Apr 2018.
Associated documents : ESANN_spatial_self_attention_final_.pdf [2Mo]  
Attention mechanisms have been shown to improve recurrent encoder-decoder architectures in sequence-to-sequence learning scenarios. Recently, the Transformer model has been proposed which only applies dot-product attention and omits recurrent operations to obtain a sourcetarget mapping [5]. In this paper we show that the concepts of self- and inter-attention can effectively be applied in an image-to-text task. The encoder applies pre-trained convolution and pooling operations followed by self-attention to obtain an image feature representation. Self-attention combines image features of regions based on their similarity before they are made accessible to the decoder through inter-attention.

 

@InProceedings{SLWW18,
  author       = "Springenberg, Sebastian and Lakomkin, Egor and Weber, Cornelius and Wermter, Stefan",
  title        = "Image-to-Text Transduction with Spatial Self-Attention",
  booktitle    = "Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)",
  pages        = "43--48",
  month        = "Apr",
  year         = "2018",
  publisher    = "i6doc",
  url          = "https://www2.informatik.uni-hamburg.de/wtm/publications/2018/SLWW18/ESANN_spatial_self_attention_final_.pdf"
}

» Sebastian Springenberg
» Egor Lakomkin
» Cornelius Weber
» Stefan Wermter