首页> 外文期刊>Journal of Intelligent Information Systems >Multilingual news extraction via stopword language model scoring
【24h】

Multilingual news extraction via stopword language model scoring

机译:通过停用词语言模型评分提取多语言新闻

获取原文
获取原文并翻译 | 示例
           

摘要

Web news provides a quick and convenient means to create collections of large documents. The creation of a web news corpus has typically required the construction of a set of HTML parsing rules to identify content text. In general, these parsing rules are written manually and treat different web pages differently. We address this issue and propose a news content recognition algorithm that is language and layout independent. Our method first scans a given HTML document and roughly localizes a set of candidate news areas. Next, we apply a designed scoring function to rank the best content. To validate this approach, we evaluate the systems performance using 1092 items of multilingual web news data covering 17 global regions and 11 distinct languages. We compare these data with nine published content extraction systems using standard settings. The results of this empirical study show that our method outperforms the second-best approach (Boilerpipe) by 6.04 and 10.79 % with regard to the relative micro and macro F-measures, respectively. We also apply our system to monitor online RSS news distribution. It collected 0.4 million news articles from 200 RSS channels in 20 days. This sample quality test shows that our method achieved 93 % extraction accuracy for large news streams.
机译:网络新闻提供了一种快速方便的方法来创建大型文档的集合。网络新闻语料库的创建通常需要构建一套HTML解析规则来标识内容文本。通常,这些解析规则是手动编写的,并且不同地对待不同的网页。我们解决了这个问题,并提出了一种与语言和布局无关的新闻内容识别算法。我们的方法首先扫描给定的HTML文档,并大致定位一组候选新闻区域。接下来,我们应用设计的评分功能对最佳内容进行排名。为了验证此方法,我们使用1092个多语言Web新闻数据项评估系统性能,这些数据涵盖了17个全球区域和11种不同的语言。我们使用标准设置将这些数据与九种已发布的内容提取系统进行了比较。这项实证研究的结果表明,就相对的微观和宏观F度量而言,我们的方法分别比第二好的方法(锅炉管道)高6.04%和10.79%。我们还将我们的系统应用于监视在线RSS新闻发布。它在20天内从200个RSS频道收集了40万条新闻报道。样本质量测试表明,我们的方法对大型新闻流的提取精度达到93%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号