Towards personalized image captioning via multimodal memory networks We address personalized image captioning, which generates a descriptive sentence for a user’s image, accounting for prior knowl- edge such as her active vocabulary or writing style in her previous documents. As applications of personalized image captioning, we solve two post automation tasks in social networks: hashtag pre- diction and post generation. The hashtag prediction predicts a list of hashtags for an image, while the post generation creates a nat- ural text consisting of normal words, emojis, and even hashtags. We propose a novel personalized captioning model named Con- text Sequence Memory Network (CSMN). Its unique updates over existing memory networks include: (i) exploiting memory as a repository for multiple types of context information, (ii) append- ing previously generated words into memory to capture long-term information, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context under- standing. For evaluation, we collect a new dataset InstaPIC-1.1M, comprising 1.1M Instagram posts from 6.3K users. We further use the benchmark YFCC100M dataset to validate the generality of our approach. With quantitative evaluation and user studies via Ama- zon Mechanical Turk, we show that the three novel features of the CSMN help enhance the performance of personalized image captioning over state-of-the-art captioning models.
展开▼