首页> 外文会议>International Conference of the CLEF Association >Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus
【24h】

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

机译:在同一西班牙语新闻语料库上的作者,性别和语言品种识别的角色n-grams和词汇特征的比较

获取原文

摘要

We compare the performance of character n-gram features (n = 3-8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n = 5-8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n = 1-2 for words and n = 3-8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol "NE" to avoid topic-dependent features.
机译:我们比较角色n-gram特征(n = 3-8)和词汇特征(单词和大字谜)的性能,以及他们的组合,以及作者归因,作者分析和相似语言之间的歧视。我们为三个上述任务开发了一个多标签语料库,由西班牙语不同品种的新闻文章组成。我们使用了相同的机器学习算法Liblinear SVM,以了解哪些功能更具预测性和哪个任务。我们的实验表明,高阶字符n-gram(n = 5-8)优于低阶字符n-gram,以及不同订单的所有单词和字符n-gram的组合(n = 1-2的单词和n = 3-8字符)通常优于这些特征的较小子集。当将所有命名实体减少到单个符号“ne”时,我们还评估字符n-grams,词汇功能及其组合的性能,以避免依赖于主题的功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号