【24h】

Automated Non-Content Word List Generation Using hLDA

机译:使用hLDA自动生成非内容单词列表

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative.
机译:在本文中,我们提出了一种独立于语言的方法,用于自动,无监督地从文档集中提取非内容词。该方法允许创建单词列表,该单词列表可以在各种自然语言处理任务中代替传统功能单词列表使用。作为示例,我们从Wikipedia文章和Wikipedia Wikitalk讨论页面中提取的英语,中文和俄语帖子的语料库生成单词列表。我们将这些列表应用于该语料库的作者归属任务,以比较用此方法提取的单词列表与专家创建的功能单词列表和常用单词列表(功能单词列表的常见替代方法)的有效性。 hLDA列表的性能与常用单词列表相当。这些试验还表明,语料库衍生的列表往往比普通列表表现更好,并且两组生成的列表都明显优于专家列表。此外,我们评估了英文专家列表在中文和俄文文档的机器翻译方面的性能,表明我们的方法也优于该方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号