This application is directed to performing optical character recognition (OCR) using deep learning techniques. An electronic device receives an image and a language indicator that indicates that the textual content in the image corresponds to a first language. The electronic device processes the image using a multilingual text recognition model applicable to a plurality of languages. The electronic device generates a feature sequence including a plurality of probability values corresponding to the textual content of the image. The feature sequence includes a plurality of feature subsets that correspond to the plurality of languages. For each feature subset, each probability value indicates a probability that a respective textual content corresponds to a respective character in a dictionary of the corresponding language. The electronic device constructs a sparse mask based on the first language and combines the feature sequence and the sparse mask to determine the textual content.
展开▼