首页> 外国专利> TEACHING LANGUAGE MODELS USING TEXT CORPUSES CONTAINING REALISTIC ERRORS OF OPTICAL CHARACTER RECOGNITION (OCR)

TEACHING LANGUAGE MODELS USING TEXT CORPUSES CONTAINING REALISTIC ERRORS OF OPTICAL CHARACTER RECOGNITION (OCR)

机译:使用包含视觉字符识别(OCR)的实际错误的文本语料库的教学语言模型

摘要

FIELD: data processing.;SUBSTANCE: invention relates to formation of a text corpus containing realistic errors of optical character recognition (OCR), and training of language models using text corpuses. To this end, an example of method implementation includes creation of computer system initial set of images based on input text-containing text corpuses; computer application of one or more simulated defects on images of initial plurality of images to create augmented set of images; forming an output text corpus based on an augmented set of images and training a language model using the obtained text corpus for optical character recognition.;EFFECT: technical result consists in improvement of image recognition quality.;20 cl, 8 dwg
机译:技术领域本发明涉及包含光学字符识别(OCR)的实际错误的文本语料库的形成,以及使用文本语料库对语言模型的训练。为此,方法实现的一个示例包括基于包含输入文本的文本语料库创建计算机系统初始图像集;以及在最初的多个图像的图像上的一个或多个模拟缺陷的计算机应用,以创建增强的图像集;形成基于增强图像集的输出文本语料库,并使用获得的文本语料库训练语言模型以进行光学字符识别。效果:技术成果在于提高图像识别质量。20cl,8 dwg

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号