Techniques are generally described for automatic scoring of alt-text for image data. In various examples, first image data and first text data describing the first image data may be received. A feature representation of the first image data may be determined using an encoder machine learning model. A hidden state representation may be determined using a decoder machine learning model based on the feature representation and a first word of the first text data. In some examples, a first score may be determined using the hidden state representation. The first score may include an indication of a descriptive capability of the first text data with respect to the first image data.
展开▼