Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

Ashraf F.; 脰zyer T.; Alhajj  R.

首页> 外文期刊>IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews >Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

【24h】

Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

机译：使用聚类技术从HTML文档中自动提取信息

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction (IE) systems try to solve this problem by making the task as automatic as possible. Most of the existing approaches, however, require user feedback in one form or another during the extraction. This paper proposes a system that employs clustering techniques for automatic IE from HTML documents containing semistructured data. Using domain-specific information provided by the user, the proposed system parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally, the output is reported. We employed a multiobjective genetic-algorithm-based clustering approach in the process; it is capable of finding the number of clusters and the most natural clustering. The proposed approach is tested by conducting experiments on a number of Web sites from different domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.

机译：在过去的几年中，万维网上可用的信息量呈指数增长。这些过多的信息可能对用户非常有益。然而，目前为此所需的人工干预量是不便的。信息提取（IE）系统试图通过使任务尽可能自动来解决此问题。但是，大多数现有方法在提取过程中都需要一种或另一种形式的用户反馈。本文提出了一种系统，该系统采用集群技术从包含半结构化数据的HTML文档中自动执行IE。拟议的系统使用用户提供的特定于域的信息，对HTML文档中的数据进行解析和标记化，将其划分为包含相似元素的群集，并根据数据标记的出现模式估算提取规则。然后，使用提取规则来精炼群集，最后报告输出。在此过程中，我们采用了基于多目标遗传算法的聚类方法。它能够找到簇数和最自然的簇。通过在来自不同域的许多网站上进行实验来测试所提出的方法。为了证明这种方法的有效性，将实验结果与文献报道的结果进行了测试，并证明具有可比性。

著录项

来源
《IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews》 |2008年第5期|p.660-673|共14页
作者
Ashraf F.; 脰zyer T.; Alhajj R.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词
Clustering; Hypertext Markup Language (HTML) documents; Web pages; information extraction (IE);

机译：群集;超文本标记语言（HTML）文档;网页;信息提取（IE）;

相似文献

外文文献
中文文献
专利

1. AUTOMATIC MACHINE LEARNING OF KEYPHRASE EXTRACTION FROM SHORT HTML DOCUMENTS WRITTEN IN HEBREW [J] . YAAKOV HACOHEN-KERNER, ITTAY STERN, DAVID KORKUS, Cybernetics and Systems . 2007,第1期

机译：从希伯来语简短HTML文档中提取关键词的自动机器学习
2. Re-structuring Html Documents Structure Automatically through Clustering [J] . Sarwar Hadi, Dr S Qamar Abbas, Sheenu Rizvi Journal of Theoretical and Applied Information Technology . 2009,第3期

机译：通过聚类自动重构HTML文档结构
3. Improving Web Document Clustering through Employing User-Related Tag Expansion Techniques [J] . Peng Li, Bin Wang, Wei Jin 计算机科学技术学报（英文版） . 2012,第003期

机译：通过使用与用户相关的标签扩展技术来改善Web文档聚类
4. Automatic extraction of text regions from document images by multilevel thresholding and k-means clustering [C] . Hoai Nam Vu, Tuan Anh Tran, In Seop Na, IEEE/ACIS International Conference on Computer and Information Science . 2014

机译：通过多级阈值和k均值聚类从文档图像中自动提取文本区域
5. ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data. [D] . Ashraf, Fatima. 2006

机译：ClusTex：使用聚类技术从包含半结构化数据的HTML页面中提取信息。
6. Using XML Metadata to Enable the Automatic Generation and Processing of HTML Forms from XML Documents [O] . Anil K. Dubey, Henry C. Chueh 2001

机译：使用XML元数据启用从XML文档自动生成和处理HTML表单的功能
7. A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones mdash; Using the Similarity between HTML Documents Constituting a Series mdash; [O] . Masayuki Umehara, Koji Iwanuma, Hirokazu Nagai 2001

机译：从HTML文档到XML Oner的基于案例的半自动转换 - 使用构成系列的HTML文档之间的相似性 -

Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅