首页> 外文OA文献 >Predicting categories of news articles using meta-data from the Web
【2h】

Predicting categories of news articles using meta-data from the Web

机译:使用来自Web的元数据预测新闻报道的类别

摘要

Text mining, a field of machine learning that deals with the discovery of knowledge from text, is evolving rapidly. This fact has been recognized by the Artificial Intelligence Laboratory of Jožef Stefan Institute, which is developing a system called Event Registry that collects news articles from the Web in real-time, detects events therein and extracts relevant information. The component of the system which deals with the classification of articles into categories has not yet been fully developed. In a response to this, in our diploma thesis, we tried to upgrade a reference model. The results of our work have been positive, since we improved the predictive accuracy of classification of arbitrary news articles into one of the categories of our predefined taxonomy. During the learning phase, we examined the impact of various forms of meta-data on the predictive accuracy of the model, where we focused mainly on meta-data obtained from Never-Ending Language Learner developed at Carnegie Mellon University. We assessed that the latter have a positive effect on the performance of the model if they are used in combination with other meta-data. For the purposes of learning we used different algorithms such as logistic regression, support vector machine, random forests and k-nearest neighbors. It turned out that the first two algorithms are the most appropriate for building the optimal predictive model. At the same time, we also tested several approaches to active learning, by which we can simplify and speed up the process of manual labeling of new articles. All of them have produced a positive result, while approach that combines uncertainty of prediction with correlation between learning instances proved to be the best.
机译:文本挖掘是机器学习的一个领域,它致力于处理来自文本的知识发现,并且发展迅速。 JožefStefan研究所的人工智能实验室已经意识到了这一事实,该实验室正在开发一个名为Event Registry的系统,该系统可以实时从Web收集新闻报道,检测其中的事件并提取相关信息。用于将物品分类到类别中的系统组件尚未完全开发。为此,我们在毕业论文中尝试升级参考模型。我们的工作取得了积极的成果,因为我们提高了将任意新闻分类为预定分类法之一的预测准确性。在学习阶段,我们检查了各种形式的元数据对模型预测准确性的影响,我们主要集中于从卡内基梅隆大学开发的永无止境的语言学习者那里获得的元数据。我们评估了如果将后者与其他元数据结合使用,则后者对模型的性能具有积极影响。为了学习的目的,我们使用了不同的算法,例如逻辑回归,支持向量机,随机森林和k最近邻。事实证明,前两种算法最适合构建最佳预测模型。同时,我们还测试了几种主动学习的方法,通过这些方法,我们可以简化和加快新文章的手动标记过程。所有这些都产生了积极的结果,而将预测的不确定性与学习实例之间的相关性相结合的方法被证明是最好的。

著录项

  • 作者

    Vučko Žiga;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号