首页> 外文期刊>International journal on digital libraries >Towards robust tags for scientific publications from natural language processing tools and Wikipedia
【24h】

Towards robust tags for scientific publications from natural language processing tools and Wikipedia

机译:借助自然语言处理工具和Wikipedia,为科学出版物提供强大的标签

获取原文
获取原文并翻译 | 示例
       

摘要

In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scientific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complementary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical properties of the obtained tags. It turns out that both methods show qualitatively similar rank-frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document follows also the same distribution for both methods and is well described by the negative binomial distribution. The developed tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.
机译:在这项工作中,提出并比较了两种简单的方法,用反映其内容的标签标记科学出版物。作为标签的第一来源,采用了维基百科。第二个标签集是根据分析的语料库中出现的名词短语构建的。语料库本身由保存在ArXiv预印本集中的70万份科学文献的摘要组成。我们对两种方法进行了比较,表明所讨论的方法在很大程度上是互补的。此外,结果为各种科学领域中Wikipedia知识的完整性提供了有趣的见解。下一步,我们检查获得的标签的统计特性。事实证明,这两种方法都显示出定性相似的秩频率相关性,最好通过拉伸的指数曲线来近似。每个文档的不同标签数量的分布也遵循两种方法的相同分布,并且由负二项式分布很好地描述了。开发的标记旨在用作各种文本挖掘任务中的功能。因此,作为最后一步,我们显示了将其应用于主题建模的初步结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号