首页> 外文会议>2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management >qRead: A fast and accurate article extraction method from web pages using partition features optimizations
【24h】

qRead: A fast and accurate article extraction method from web pages using partition features optimizations

机译:qRead:使用分区功能优化从网页中快速,准确地提取文章的方法

获取原文
获取原文并翻译 | 示例

摘要

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.
机译:我们提出了一种称为qRead的新方法,可以实现从网页中实时提取高精度内容。内容提取的早期方法包括经验过滤规则,文档对象模型(DOM)树和机器学习模型。这些方法虽然取得了一定的成功,但可能无法满足高精度实时提取的要求。例如,在复杂的网页上构建DOM树非常耗时,而使用机器学习模型会使事情变得不必要地更加复杂。与以前的方法不同,qRead使用段密度和相似度来标识主要内容。特别是,qRead首先会过滤明显的垃圾内容,消除HTML标签,并将剩余的文本划分为自然段。然后,它使用段中行数最高的单词比率以及段和标题之间的相似性来标识主要内容。我们显示,通过广泛的实验,qRead在中文网页上的准确度达到96.8%,平均提取时间为13.20毫秒,在英文网页上的准确度为93.6%,平均提取时间为11.37毫秒,从而大大提高了准确性超越以前的方法并满足实时提取要求。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号