首页> 外文会议>8th IEEE International Conference on e-Business Engineering >Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites
【24h】

Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

机译:从泰国网站生成标签云的自动Web内容提取

获取原文

摘要

This paper proposes a novel Web content extraction approach based on heuristic rules and the XPath utility in XML. The main objective is to address the problem of Web visualization by generating tag clouds from Thai Web sites in order to provide an overview of the key words in the Web pages. This paper also proposes a detailed method to assess the Web content extraction technique on a single Web page by using the length of the extracted content. There are three main steps in the proposed technique: Web page elements and features extraction, Block detection, and Content extraction selection. The empirical results have shown this technique produces high accuracies.
机译:本文提出了一种新的基于启发式规则和XML中的XPath实用程序的Web内容提取方法。主要目的是通过从泰国网站生成标签云来解决Web可视化问题,以便概述Web页中的关键字。本文还提出了一种利用提取的内容的长度来评估单个网页上的Web内容提取技术的详细方法。所提出的技术包括三个主要步骤:网页元素和功能提取,块检测和内容提取选择。实验结果表明,该技术具有很高的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号