首页> 外文OA文献 >Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites
【2h】

Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites

机译:通过扩展新颖的单页提取方法来改善网页内容提取:以泰国网站为例

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.
机译:本文提出了Web内容提取技术。该技术能够基于启发式规则处理单个页面和多个页面。在多页提取中提出了一种提取内容匹配(ECM)技术,以识别提取结果中的噪声。还介绍了此技术的一些功能,以减少处理时间,例如使用XPath,文件压缩和并行处理。通过使用提取的内容的长度,基于精度,召回率和F度量对性能进行评估。通过比较所提出的方法与手工提取方法的结果,初步结果很好。

著录项

  • 作者单位
  • 年度 2012
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号