首页> 外文期刊>IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews >Employing Clustering Techniques for Automatic Information Extraction From HTML Documents
【24h】

Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

机译:使用聚类技术从HTML文档中自动提取信息

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction (IE) systems try to solve this problem by making the task as automatic as possible. Most of the existing approaches, however, require user feedback in one form or another during the extraction. This paper proposes a system that employs clustering techniques for automatic IE from HTML documents containing semistructured data. Using domain-specific information provided by the user, the proposed system parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally, the output is reported. We employed a multiobjective genetic-algorithm-based clustering approach in the process; it is capable of finding the number of clusters and the most natural clustering. The proposed approach is tested by conducting experiments on a number of Web sites from different domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.
机译:在过去的几年中,万维网上可用的信息量呈指数增长。这些过多的信息可能对用户非常有益。然而,目前为此所需的人工干预量是不便的。信息提取(IE)系统试图通过使任务尽可能自动来解决此问题。但是,大多数现有方法在提取过程中都需要一种或另一种形式的用户反馈。本文提出了一种系统,该系统采用集群技术从包含半结构化数据的HTML文档中自动执行IE。拟议的系统使用用户提供的特定于域的信息,对HTML文档中的数据进行解析和标记化,将其划分为包含相似元素的群集,并根据数据标记的出现模式估算提取规则。然后,使用提取规则来精炼群集,最后报告输出。在此过程中,我们采用了基于多目标遗传算法的聚类方法。它能够找到簇数和最自然的簇。通过在来自不同域的许多网站上进行实验来测试所提出的方法。为了证明这种方法的有效性,将实验结果与文献报道的结果进行了测试,并证明具有可比性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号