A Survey on Recent Advances in Image Captioning

Image captioning, an interdisciplinary research field of computer vision and natural language processing, has attracted extensive attention. Image captioning aims to produce reasonable and accurate natural language sentences to describe images. It requires the captioning model to recognize objects and describe their relationships accurately. Intuitively, it is difficult for a machine to have the general image understanding ability like human beings. However, deep learning provides the basis for intelligent exploration. In this review, we will focus on recent advanced deep methods for image captioning. We classify existing methods into different categories and discuss these categories respectively; meanwhile, we discuss the related datasets and evaluation metrics. We also prospect the future research directions.


Introduction
With the development of intelligent terminal and social network technology, multi-modal data (e.g., image, text, video) has increased dramatically, such as posts and articles on the microblog, online news, and advertisements.
Massive data is the basis of the development of modern intelligent technology, especially the data covering multiple modals. If the information in different modals can be aligned, it will undoubtedly power machine intelligence and help solve practical social problems. For human beings, it is easy to transform between modals, because life experiences and continuous practices bring us the ability of multi-modal understanding. However, multi-modal transformation remains a challenging task for machines.
This paper focuses on the transformation from image to natural language sentence: image captioning. Compared with traditional computer vision tasks such as object tracking [1] and multi-modal retrieval [2], image captioning is more challenging. It requires the model to further weigh the scene of each object and relationships between objects, to generate correct natural language descriptions, syntactically and semantically.
As an essential step to explore machine intelligence, image captioning has wide application scenarios. For example, for people with visual impairment, such as blindness and color blindness, image captioning can help them achieve visual understanding; in the image retrieval task, image captioning helps the model achieve more accurate content-based retrieval; in the field of intelligent robots, image captioning can be embedded into intelligent systems to help machines gain better visual understanding.
Thanks to the successful application and rapid development of deep learning technology, image captioning has achieved remarkable success in the past few years. There are reviews in this field that well summarize the work in this field. However, the recent advances are not fully covered. This paper is a review of the latest progress in this field to fill this gap. We only focus on the recent advances and divide these methods into different categories.

Attention-enhanced methods
To capture more fine-grained visual information during generation, attentive methods are proposed to ground generated words on salient image parts. Xu et al. [6] is the first to introduce attention mechanisms for image captioning: a hard attention mechanism and a soft attention mechanism. This method focuses on the most relevant image regions during generation, yet a drawback stands out: it utilizes image features from the lower CNN layer which may fail to capture high-level information. Jin et al. [7] extracted visual information of particular objects, and predicted topic vectors of images which were then used as the contexts during generation. Rather than associating image features only to the decoder state, Pedersoli et al. [8] proposed the area-based attention mechanisms. It models associations between the generated words and the image regions. Lu et al. [9] considered that not all of the caption words have corresponding visual parts thus they introduced a novel adaptive attention mechanism that can decide when to rely on image regions and when to only rely on the previously generated words of the language model. Anderson et al. [10] proposed to combine both the Bottom-Up and the Top-Down attention mechanisms for caption generation: Bottom-Up proposed a set of image regions represented by pooled convolutional feature vectors and Top-Down calculated attention weights over these regions during generation. A traditional attention-based model aligns the attended image feature to one captioning word at each time step, therefore, Huang et al. [11] further proposed a model to align image features and the target words adaptively for caption generation. In this method, a base attention model takes one single attention step per time step, then an adaptive attention module learns how many times to attend. The output of a previous attention mechanism directly depends on the current attention result, which lacks relationship modeling between the attended feature and the attention query. To tackle this problem, Huang et al. [12] further presented an attention-on-attention (AoA) module to model the relevance between the attention result and the attention query. Based on the attention result and the query, it firstly generates an information vector and an attention gate, then utilizes another attention operation to them, thus the finally the expected knowledge is obtained.

Semantic-enhanced methods
Semantic-enhanced methods exploit the high-level semantic features of images for captioning, e.g., attributes, concepts, and structural semantics. Wu et al. [13] first introduced high-level semantics into the encoder-decoder framework and verified the effectiveness of explicit representations of attributes for image captioning. Yao et al. [14] further presented a novel LSTM with attributes, injecting attribute information into the captioning model. They explore the inter-attribute relationships by multiple instance learning and augment attributes to complement image representations. Zhang et al. [15] proposed to use part of speech guidance. They utilize attention-based methods to take advantage of the generated part of speech information and use the multi-task learning strategy to facilitate training. Huang et al. [16] modeled the similarity between an attribute concept and an object, and selected subsequent attributes to attend to during generation. In this method, one generated word is relevant to a sub-sequence of attributes, enhancing semantic explainability. Most recently, structural semantics have been explored. Chen et al. [17] used the Abstract Scene Graph (ASG) to decide what and how detailed to say in the final description. ASG is a directed graph that contains objects, attributes, and relations grounded in visual features, but it has nothing to do with explicit semantics. Yang et al. [18] proposed a scene graph auto-encoder to exploit the language inductive bias as a priori. They use a scene graph to represent structural semantics information of an image and learn a dictionary to reconstruct the caption.

Transformer-based methods
Image captioning models based on transformer [19] has achieved great success. Li et al. [20] investigated a transformer-based model with only the attention and the feed-forward layers, which exploited both the visual and the semantic information. Technically, they apply a bilateral gating mechanism to control the propagation of the visual and the semantic information. To find a way to capture higher-order interactions, Pan et al. [21] designed x-linear attention blocks based on bilinear pooling, in which the spatial-and the channel-wise bilinear attention features capture higher-order interactions among the input features. Cornia et al. [22] proposed a meshed-memory transformer for image captioning which modeled the relationships among regions and learned a priori knowledge, and designed a meshed connection for decoding. Since previous transformer-based methods only focus on region-level information without considering the global representation of an image, Ji et al. [23] Leveraged the intra-and inter-layer global representation in transformer for captioning. It abstracts a comprehensive global representation of the input image as guidance to encourage the decoder to generate better captions.

Post-editing based methods
Traditional image captioning methods predict the target sentences word by word in a one-pass manner. However, such a single-pass decoding process can neither correct the previously generated words nor take a global semantic perception. Inspired by the post-editing [24], some works take advantage of a two-pass decoding framework for caption generation. Post-editing based image captioning methods generate refined captions based on the first-pass generated captions and the image features. Guo et al. [25] proposed a Ruminant Network, in which a ruminant decoder refined the first-pass generated captions by a base decoder to re-generate new comprehensive captions. Sammani et al. [26] further employed a copy mechanism to make the best of the existing caption. The proposed method can copy or fix the details of existing captions. Most recently, Song et al. [27] presented a Context-Aware Auxiliary Guidance mechanism (CAAG) to provide global semantic guidance for the second-pass generation process. In this framework, the pre-generated caption is encoded as semantic features, and CAAG performs attention to these features to select useful information for the refined-caption generation.

Vision-language pre-training
Inspired by the success of pre-training models, Vision-Language Pre-training (VLP) has been proposed. Zhou et al. [28] presented a unified VLP model that can be fine-tuned for visual-language generation and understand tasks, which was pre-trained on a large amount of image-text pairs. It's verified that the pre-training stage can significantly speed up the learning and improve model performance of image captioning. But, it simply concatenates object features and image region features and uses self-attention to learn alignments between these features. Thus Li et al. [29] proposed to use object tags of images as anchor points to ease the alignment learning. It achieves new state-of-the-art performance on image captioning. Methods use image-caption pairs for pre-training fail to generate novel object captions, thus Hu et al. [30] broke this dependency and pre-trained visuallanguage alignments based on image-tag pairs, boosting image captioning on novel object captioning and the general task. Flickr8k [31], Flickr30k [32], and MS COCO [33] are widely used datasets for image captioning. We compare these datasets in Table 1.

Datasets
Flickr8k. A dataset that contains 8,000 images and each is annotated with five natural language descriptions based on crowd-sourcing service from Amazon Mechanical Turk. The images are mainly about humans and animals, and each sentence provides the salient entities and events of the corresponding image.
Flickr30k. An extension of Flickr8k which contains 31,783 images in total. It further provides comprehensive ground-truth correspondences between image regions and caption phrases. MS COCO. The most popular dataset for image captioning, which has 123,287 images in total. Each image is annotated with at least five caption sentences. It also provides an online test service with an online test split which contains 40,775 images. Captions of the online test split are not available publicly.

Evaluation metrics
The widely used automatic evaluation metrics contains BLEU [34], METEOR [35], CIDEr [36], ROUGE [37] and SPICE [38]. Note that, higher scores indicate better sentences. BLEU measures the matching degree of n-gram phrases in the generated sentence and the corresponding ground truth sentence (annotated by humans). Compared with BLEU, METEOR makes better semantic correlation and is more relevant to human judgments. Specifically, it uses WordNet to take stems and synonyms into consideration. CIDEr performs human consensus. It uses the term frequency-inverse document frequencies (TF-IDF) of the generated and the ground truth sentences to calculate the similarity score (cosine distance). ROUGE measures the overlapping units such as word sequences, word pairs, and n-grams between the generated sentence and the ground truth sentence. SPICE is calculated based on semantic concepts. It transforms both the generated sentence and the ground truth sentence into an intermediate representation and measures how well the generation model recovers objects, attributes, and the relations between them.

Future research directions
In recent years, image captioning has achieved great success. However, in our viewpoint, there are still some directions worthy to be explored. First, existing methods rely heavily on caption-annotated images, lacking the use of those without annotations. Therefore, unsupervised learning will be a way to solve this problem. Second, most current approaches generate captions of images, but the caption phrases are not well aligned to the image elements. More grounded image captions should be further explored. Last but not least, external knowledge is a good point to probe, since it can help the model to generate more informative sentences and enhances knowledge reasoning in the generated caption.

Conclusion
In this paper, we review the advanced image captioning methods and classify these methods into different categories. Representative methods of each category are discussed. We also give a brief introduction of the evaluation metrics and widely used datasets. Furthermore, we briefly discuss the future research directions of image captioning.