Techniques are described herein for training and evaluating machine learning (ML) models for document processing computing applications using generalized vocabulary tokens. In some embodiments, an ML system determines a set of tokens for non-textual content in a plurality of documents. The ML system generates a fixed-length vocabulary that includes the set of tokens for the non-textual content. The ML system further generates for each respective document in a training dataset of documents, a respective feature vector based at least in part on which tokens in the fixed-length vocabulary occur in the respective document. The ML system trains a ML model based at least in part on the respective feature vector for each respective document in the training dataset.
展开▼