首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >TEXT: Automatic Template Extraction from Heterogeneous Web Pages
【24h】

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

机译:文本:从异构网页中自动提取模板

获取原文
获取原文并翻译 | 示例
       

摘要

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.
机译:万维网是最有用的信息源。为了实现高发布效率,通过使用带有内容的通用模板来自动填充许多网站中的网页。模板使读者可以轻松访问以一致的结构为指导的内容。但是,对于计算机而言,模板被认为是有害的,因为由于模板中不相关的术语,模板会降低Web应用程序的准确性和性能。因此,模板检测技术近来已引起很多关注,以提高搜索引擎的性能,群集和Web文档的分类。在本文中,我们提出了新颖的算法,用于从大量从异构模板生成的Web文档中提取模板。我们基于文档中基础模板结构的相似性对Web文档进行聚类,以便同时提取每个聚类的模板。我们开发了一种快速逼近聚类的优良性度量,并提供了对算法的综合分析。与模板检测算法的最新技术相比,我们的真实数据集实验结果证实了我们算法的有效性和鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号