首页> 外文会议>Asia-Pacific Web Conference >Table Detection from Plain Text Using Machine Learning and Document Structure
【24h】

Table Detection from Plain Text Using Machine Learning and Document Structure

机译:使用机器学习和文档结构从纯文本的表检测

获取原文

摘要

Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naive Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.
机译:本文解决了纯文本的表提取问题。表是呈现信息的最常见模式之一。表提取在信息检索,知识获取和文本挖掘中具有应用。从表中提取的自动信息是一个挑战。现有方法主要关注网页的表提取(格式化表提取)。到目前为止,表提取对纯文本的问题,据我们所知,尚未得到足够的关注。在本文中,未格式化的表提取被形式化为未格式化的表块检测和未格式化的表行标识。我们特别专注于中文文件的表。我们建议通过组合机器学习方法和文档结构来开展表提取的任务。我们首先将任务视为分类,并提出了一种基于天真贝叶斯处理它的统计方法。我们定义分类模型中的功能。接下来,我们使用文档结构来提高检测性能。实验结果表明,所提出的方法可以显着优于未格式化的表提取的基线方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号