首页> 外文期刊>Information Processing & Management >A clustering approach to extract data from HTML tables
【24h】

A clustering approach to extract data from HTML tables

机译:从HTML表中提取数据的聚类方法

获取原文
获取原文并翻译 | 示例
       

摘要

HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiency.
机译:HTML表已在Web上变得普遍。 自动提取数据很难,因为由于许多不同的布局,编码和格式,它们的单元格之间的关系并不琐碎。 在本文中,我们介绍了Melva,这是一个无人监督的域名禁止建议,以从HTML表中提取数据而不需要任何外部知识库。 它依赖于聚类方法,帮助使标签单元与价值细胞分开并建立其关系。 我们将Melva与来自维基百科和德累斯顿Web表语料库的超过3 000多个HTML表比较了四个竞争对手。 结论是,我们的提案比最佳无人监督竞争对手更好,并等于有效性的最佳监督竞争对手,但效率更好99.14%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号