Weak supervision data regarding a target image is obtained and utilized to provide detailed information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervision data for visually similar images may be collected from sources of weakly annotated images, such as online social networks. Generally, images posted online include "weak" annotations in the form of tags, titles, labels, and short descriptions added by users. Weak supervision data for the target image is generated by extracting keywords for visually similar images discovered in the different sources. Separate independent claims are provided that include: feature extraction; the use of convolutional neural networks (CNN); and, a semantic attention model using weighted keywords. The methods could also make use of a language processing model.
展开▼