首页> 外文会议>ACM symposium on document engineering >An Efficient Language-Independent Method to Extract Content from News Webpages
【24h】

An Efficient Language-Independent Method to Extract Content from News Webpages

机译:一个有效的语言无关方式,可以从新闻网页中提取内容

获取原文

摘要

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.
机译:我们解决新闻网页细分的任务,专门识别新闻标题,出版日期和故事机构。虽然文学中有很好的结果,但其中大多数都依赖于网页渲染,这是一个非常耗时的步骤。我们专注于具有大量文档的方案,其中表现是必须的。所选方法在该区域中扩展了我们以前的工作,将结构性属性与视觉演示样式的提示相结合,使用比常规渲染更快的方法和机器学习算法计算。在我们的实验中,我们特别关注文学中往往被忽视的某些方面,例如处理时间和未经化域提取结果的泛化。我们的方法已经表现出比相同的全面渲染替代方案快,同时保留了良好质量的提取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号