首页> 外国专利> COLLECTING TRAINING DATA FROM TeX FILES

COLLECTING TRAINING DATA FROM TeX FILES

机译：从TeX文件中收集训练数据

页面导航

摘要
著录项
相似文献

摘要

A method of collecting training data of a document component may be provided. The documents have a structure and are coded in the typesetting language TeX. The method comprise receiving a TeX source file, compiling it into a PDF file and a related sync file, analyzing the PDF file, thereby determining a non-text-only document component. The method comprises also determining first coordinates of the non-text-only document component and a corresponding page number, determining a typesetting command relating to a non-text-only document component and determining second coordinates of a bounding box and a corresponding page number from the sync file, determining text elements in the non-text-only document component of the PDF file for which the first coordinates and the second coordinates overlap, and combining the determined text elements and linking them to a type of a non-text document component determined in the non-text-only document component in the TeX source file.

机译：可以提供一种收集文档组件的训练数据的方法。这些文档具有结构，并以排版语言TeX编码。该方法包括：接收TeX源文件，将其编译为PDF文件和相关的同步文件，分析PDF文件，从而确定非纯文本文档组件。该方法还包括确定非纯文本文档组件的第一坐标和相应的页码，确定与非纯文本文档组件有关的排版命令以及确定边框的第二坐标和相应的页码。同步文件，确定PDF文件的非纯文本文档组件中的第一坐标和第二坐标重叠的文本元素，并组合确定的文本元素并将其链接到非文本文档组件的类型在TeX源文件的非纯文本文档组件中确定。

著录项

公开/公告号US2020257755A1

专利类型
公开/公告日2020-08-13

原文格式PDF
申请/专利权人 INTERNATIONAL BUSINESS MACHINES CORPORATION;
展开▼

申请/专利号US201916270798
发明设计人 PETER WILLEM JAN STAAR;MICHELE DOLFI;CHRISTOPH AUER;ALEKSANDROS SOBCZYK;KONSTANTINOS BEKAS;
展开▼

申请日2019-02-08
分类号G06F17/21;G06F17/27;G06F17/22;G06N20;
国家 US
入库时间 2022-08-21 11:26:16

相似文献

专利
外文文献
中文文献