A clustering approach to extract data from HTML tables

Patricia Jimenez; Juan C. Roldan; Rafael Corchuelo

首页> 外文期刊>Information Processing & Management >A clustering approach to extract data from HTML tables

【24h】

A clustering approach to extract data from HTML tables

机译：从HTML表中提取数据的聚类方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiency.

机译：HTML表已在Web上变得普遍。自动提取数据很难，因为由于许多不同的布局，编码和格式，它们的单元格之间的关系并不琐碎。在本文中，我们介绍了Melva，这是一个无人监督的域名禁止建议，以从HTML表中提取数据而不需要任何外部知识库。它依赖于聚类方法，帮助使标签单元与价值细胞分开并建立其关系。我们将Melva与来自维基百科和德累斯顿Web表语料库的超过3 000多个HTML表比较了四个竞争对手。结论是，我们的提案比最佳无人监督竞争对手更好，并等于有效性的最佳监督竞争对手，但效率更好99.14％。

著录项

来源
《Information Processing & Management》 |2021年第6期|102683.1-102683.13|共13页
作者
Patricia Jimenez; Juan C. Roldan; Rafael Corchuelo;
展开▼
作者单位

University of Seville ETSI Informatica Avda. Reina Mercedes s/n. Sevilla E-41012 Spain;

University of Seville ETSI Informatica Avda. Reina Mercedes s/n. Sevilla E-41012 Spain;

University of Seville ETSI Informatica Avda. Reina Mercedes s/n. Sevilla E-41012 Spain;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
HTML tables; Data extraction; Clustering; Genetic algorithms;

机译：HTML表;数据提取;聚类;遗传算法;
入库时间 2022-08-19 03:06:55

相似文献

外文文献
中文文献
专利

1. On extracting data from tables that are encoded using HTML [J] . Knowledge-Based Systems . 2020,第Feba29期

机译：从使用HTML编码的表中提取数据时
2. Extracting Personalised Ontology from Data-Intensive Web Application: an Html Forms-Based Reverse Engineering Approach [J] . Sidi Mohamed BENSLIMANE, Mimoun MALKI, Mustapha Kamal RAHMOUNI, Informatica . 2007,第4期

机译：从数据密集型Web应用程序中提取个性化本体：基于HTML表单的逆向工程方法
3. Extracting logical structures from HTML tables [J] . Yeon-Seok Kim, Kyong-Ho Lee Computer standards & interfaces . 2008,第5期

机译：从HTML表中提取逻辑结构
4. An XML Approach to Semantically Extract Data from HTML Tables [C] . Jixue Liu, Zhuoyun Ao, Ho-Hyun Park, International Conference on Database and Expert Systems Applications . 2005

机译：来自HTML表的语义提取数据的XML方法
5. ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data. [D] . Ashraf, Fatima. 2006

机译：ClusTex：使用聚类技术从包含半结构化数据的HTML页面中提取信息。
6. DAFi: A Directed Recursive Data Filtering and Clustering Approach for Improving and Interpreting Data Clustering Identification of Cell Populations from Polychromatic Flow Cytometry Data [O] . Alexandra J. Lee, Ivan Chang, Julie G. Burel, -1

机译：DAFi：一种有指导性的递归数据过滤和聚类方法用于改进和解释多色流式细胞仪数据对细胞群体的数据聚类识别
7. On extracting data from tables that are encoded using HTML [O] . Juan C. Roldán, Patricia Jiménez, Rafael Corchuelo 2020

机译：从使用HTML进行编码的表中提取数据
8. Fuzzy Clustering and Superclustering Scheme for Extracting Structure from Data [R] . Smith, J. F. 1996

机译：基于数据提取结构的模糊聚类与超集群方案

A clustering approach to extract data from HTML tables

摘要

著录项

相似文献

相关主题

期刊订阅