【24h】

Automated Non-Content Word List Generation Using hLDA

机译:使用HLDA自动非内容字列表生成

获取原文

摘要

In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative.
机译:在本文中,我们提出了一种独立于自动,无人监督的非内容词从文档语料库中提取的语言的方法。该方法允许在各种自然语言处理任务中创建可以用于代替传统功能字列表的单词列表。作为一个示例,我们从维基百科文章和维基百科维基展会讨论页面中提取的英语,中文和俄语帖子中的文字列表。我们将这些列表应用于此语料库上的作者归属的任务,以比较用该方法提取的单词列表的有效性,以专业创建的函数字列表和频繁的单词列表(一个功能字列表的常见替代品)。 HLDA列表与频繁的单词列表相当执行。该试验还表明,语料库派生的列表倾向于比更多的通用列表更好,并且两组生成的列表都显着优于专家列表。此外,我们还评估了英语专家列表的表现在中国和俄罗斯文档的机器翻译中,表明我们的方法也优于这种替代方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号