【24h】

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

机译:并非所有字符N-gram都相等:作者身份归因研究

获取原文

摘要

Character n-grams have been identified as the most successful feature in both single-domain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morpho-syntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character n-grams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.
机译:在单域和跨域作者身份归因(AA)中,字符n-gram已被识别为最成功的功能,但其判别价值的原因尚未完全弄清。我们确定字符n-元组的子组,这些子组与通常声称由以下特征覆盖的语言方面相对应:词法语法,主题内容和样式。我们在两个AA设置中评估这些组中每个组的可预测性:单个域设置和存在多个主题的跨域设置。我们证明了字符n-gram捕获了有关词缀和标点的信息,几乎可以解释字符n-gram作为特征的所有功能。我们的研究为将来的AA工作和其他分类任务中使用n-gram提供了新的见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号