Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

机译：使用聚类和编辑距离技术自动进行Web数据提取

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.

机译：许多Web资源都提供对包含结构化数据的基础数据库的访问。这些数据通常只能以HTML形式访问，这使得软件程序很难以结构化形式获取它们。但是，Web源通常使用一致的模板或布局对数据记录进行编码，并且模板中的隐式规则可用于自动推断结构并提取数据。在本文中，我们提出了一套新颖的技术来解决这个问题。尽管先前的几本著作都解决了相同的问题，但其中大多数都需要多个输入页面，而我们的方法只需要一个页面。另外，先前的方法对数据记录如何编码到网页中进行了一些假设，而这些并不总是存在于真实的网站中。最后，我们已经使用大量真实的网络资源测试了我们的技术，并且发现它们非常有效。

著录项

来源
《International Conference on Web Information Systems Engineering(WISE 2007); 20071203-07; Nancy(FR)》|2007年|P.212224|共2页
会议地点 Nancy(FR)
作者
Manuel Alvarez; Alberto Pan; Juan Raposo; Fernando Bellas; Fidel Cacheda;
展开▼
作者单位

Department of Information and Communications Technologies University of A Coruna, Campus de Elvina s. 15071. A Coruna, Spain;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. Infrared decoding and vertical sync extraction techniques for automatic home video editing [J] . Wong E.M.-C. IEEE Transactions on Consumer Electronics . 1990,第4期

机译：红外解码和垂直同步提取技术，用于自动家庭视频编辑
2. Information Extraction from Web Pages Using a Tree Edit Distance Measure [J] . Tetsuji KUBOYAMA, Tetsuhiro MIYAHARA 電子情報通信学会技術研究報告. ディペンダブルコンピュ-ティング. Dependable Computing . 2004,第347期

机译：使用树形编辑距离度量从网页中提取信息
3. Information Extraction from Web Pages Using a Tree Edit Distance Measure [J] . Tetsuji KUBOYAMA, Tetsuhiro MIYAHARA 電子情報通信学会技術研究報告. デ-タ工学. Data Engineering . 2004,第345期

机译：使用树形编辑距离度量从网页中提取信息
4. Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction [C] . Manuel Alvarez, Alberto Pan, Juan Raposo, International Conference on Web Information Systems Engineering . 2007

机译：使用聚类和编辑距离技术进行自动Web数据提取
5. ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data. [D] . Ashraf, Fatima. 2006

机译：ClusTex：使用聚类技术从包含半结构化数据的HTML页面中提取信息。
6. Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data [O] . Basel Abu-Jamous, Steven Kelly 2018

机译：Clust：从基因表达数据中自动提取最佳的共表达基因簇
7. Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction∗ [O] . Manuel Álvarez, Alberto Pan, Juan Raposo, 2008

机译：使用聚类和编辑距离技术进行自动Web数据提取*
8. A Simple Technique for Automatic Computer Editing of Biodata [R] . Lewis, C. E., Swaroop, R., West, K. A. 1969

机译：Biodata自动计算机编辑的一种简便技术

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

摘要

著录项

相似文献

相关主题

期刊订阅