首页> 外文会议>18th ACM conference on information and knowledge management 2009 >A Fast and Simple Method for Extracting Relevant Content from News Webpages
【24h】

A Fast and Simple Method for Extracting Relevant Content from News Webpages

机译:一种快速简便的从新闻网页中提取相关内容的方法

获取原文

摘要

We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements.Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.
机译:我们提出了NCE,这是一种从新闻网页中识别和提取相关内容的有效算法。我们将相关部分定义为更客观地描述文章中主要事件的文本部分。这包括标题和主体部分,不包括有关故事和演示元素的评论。 我们的实验表明,就提取质量而言,NCE与文献中提供的最佳方法相比具有竞争力。在我们的测试语料库中,它包含22个站点的324个新闻网页,其F1 = 90.7%。我们方法的主要优点是它的简单性和计算性能。它比使用视觉特征的方法至少快一个数量级。此特征非常适合处理大量页面的应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号