An Efficient Language-Independent Method to Extract Content from News Webpages

机译：一个有效的语言无关方式，可以从新闻网页中提取内容

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.

机译：我们解决新闻网页细分的任务，专门识别新闻标题，出版日期和故事机构。虽然文学中有很好的结果，但其中大多数都依赖于网页渲染，这是一个非常耗时的步骤。我们专注于具有大量文档的方案，其中表现是必须的。所选方法在该区域中扩展了我们以前的工作，将结构性属性与视觉演示样式的提示相结合，使用比常规渲染更快的方法和机器学习算法计算。在我们的实验中，我们特别关注文学中往往被忽视的某些方面，例如处理时间和未经化域提取结果的泛化。我们的方法已经表现出比相同的全面渲染替代方案快，同时保留了良好质量的提取。

著录项

来源
《ACM symposium on document engineering》|2011年||共7页
会议地点
作者
Eduardo Cardoso; Iam Jabour; Eduardo Laber; Rogerio Rodrigues; Pedro Cardoso;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词
news segmentation; webpage rendering;

机译：新闻细分;网页渲染;

相似文献

外文文献
中文文献
专利

1. A Layout Based Detachment Approach for Extracting Content from Webpages | Science Publications [J] . Anna Saro Vijendran, Deepa Chandran American journal of applied sciences . 2015,第6期

机译：基于布局的分离方法从网页中提取内容科学出版物
2. A Layout Based Detachment Approach for Extracting Content from Webpages [J] . Deepa Chandran, Anna Saro Vijendran American journal of applied sciences . 2015,第6期

机译：基于布局的分离方法从网页中提取内容
3. An efficient method to reduce grain angle influence on NIR spectra for predicting extractives content from heartwood stem cores of Toona. sinensis [J] . Yanjie Li, Xin Dong, Yang Sun, Plant methods . 2020,第1期

机译：一种有效的方法，减少对NIR光谱的粒度影响，以预测食品杆菌的提取物含量。 sinensis.
4. An Efficient Language-Independent Method to Extract Content from News Webpages [C] . Eduardo Cardoso, Iam Jabour, Eduardo Laber, Proceedings of the 2011 ACM symposium on document engineering. . 2011

机译：一种与语言无关的有效方法，可从新闻网页中提取内容
5. Computational methods applied to mass communication research: The case of press release content in news media. [D] . Golitsynskiy, Sergey. 2013

机译：适用于大众传播研究的计算方法：新闻媒体中新闻稿内容的情况。
6. An efficient method to reduce grain angle influence on NIR spectra for predicting extractives content from heartwood stem cores of Toona. sinensis [O] . Yanjie Li, Xin Dong, Yang Sun, 2020

机译：一种有效的方法来减少晶角对NIR光谱的影响从而预测香椿心材茎干中的提取物含量。中华
7. Learning to Extract Content from News Webpages [O] . Alex Spengler, Patrick Gallinari 2015

机译：学习从新闻网页中提取内容

An Efficient Language-Independent Method to Extract Content from News Webpages

摘要

著录项

相似文献

相关主题

期刊订阅