首页> 外文会议>Asia-Pacific Web Conference >Table Detection from Plain Text Using Machine Learning and Document Structure

【24h】

Table Detection from Plain Text Using Machine Learning and Document Structure

机译：使用机器学习和文档结构从纯文本的表检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naive Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.

机译：本文解决了纯文本的表提取问题。表是呈现信息的最常见模式之一。表提取在信息检索，知识获取和文本挖掘中具有应用。从表中提取的自动信息是一个挑战。现有方法主要关注网页的表提取（格式化表提取）。到目前为止，表提取对纯文本的问题，据我们所知，尚未得到足够的关注。在本文中，未格式化的表提取被形式化为未格式化的表块检测和未格式化的表行标识。我们特别专注于中文文件的表。我们建议通过组合机器学习方法和文档结构来开展表提取的任务。我们首先将任务视为分类，并提出了一种基于天真贝叶斯处理它的统计方法。我们定义分类模型中的功能。接下来，我们使用文档结构来提高检测性能。实验结果表明，所提出的方法可以显着优于未格式化的表提取的基线方法。

著录项

来源
《Asia-Pacific Web Conference》|2006年||共6页
会议地点
作者
Juanzi Li; Jie Tang; Qiang Song; Peng Xu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. 散光对使用C字视力表与E字视力表检测视力的影响 [J] . 彭鹏, 李琳, 武思宇眼科学报（英文版） . 2019,第002期
2. Text Document Categorization using Machine Learning Algorithm in Agricultural Domain [J] . Sreekumar Biswas, Rajni Jain Journal of the Indian Society of Agricultural Statistics . 2018,第1期

机译：用农业域中机器学习算法进行文本文档分类
3. On the influence of training data quality on text document classification using machine learning methods [J] . Jyri Saarikoski, Henry Joutsijoki, Kalervo Jaervelin, International Journal of Knowledge Engineering and Data Mining . 2015,第2期

机译：训练数据质量对机器学习方法对文本文档分类的影响
4. Deep Learning-Based Document Modeling for Personality Detection from Text [J] . Navonil Majumder, Soujanya Poria, Alexander Gelbukh, IEEE intelligent systems . 2017,第2期

机译：基于深度学习的文档模型用于文本个性检测
5. Table Detection from Plain Text Using Machine Learning and Document Structure [C] . Juanzi Li, Jie Tang, Qiang Song, Asia-Pacific Web Conference . 2006

机译：使用机器学习和文档结构从纯文本的表检测
6. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning [D] . Mysore Gopinath, Abhijith Athreya 2018

机译：使用无监督和有监督的学习自动检测HTML文档中的节标题和散文
7. What is relevant in a text document?: An interpretable machine learning approach [O] . Leila Arras, Franziska Horn, Grégoire Montavon, -1

机译：文本文档中有什么相关内容？：一种可解释的机器学习方法
8. Mood Detection Based on Arabic Text Documents using Machine Learning Methods [O] . Abdelbaset Hussein 2020

机译：基于使用机器学习方法的阿拉伯文文本的情绪检测

Table Detection from Plain Text Using Machine Learning and Document Structure

摘要

著录项

相似文献

相关主题

期刊订阅