首页> 外文会议>International conference of the CLEF Association >Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus
【24h】

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

机译:同一西班牙新闻语料库上的字符n-gram和作者,性别和语言多样性识别的词汇特征的比较

获取原文

摘要

We compare the performance of character n-gram features (n = 3-8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n = 5-8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n = 1-2 for words and n = 3-8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol "NE" to avoid topic-dependent features.
机译:我们比较了字符n-gram特征(n = 3-8)和词汇特征(单词的字母组合和双字母组合)及其组合在作者身份归因,作者概要分析和区分相似语言的任务上的性能。我们为上述三个任务开发了一个带有多个标签的语料库,该语料库由西班牙语中不同品种的新闻文章组成。我们使用相同的机器学习算法Liblinear SVM,以找出哪些功能更具预测性,以及针对哪些任务。我们的实验表明,高阶字符n-gram(n = 5-8)优于低阶字符n-gram,以及所有单词和不同阶数的字符n-gram的组合(n = 1-2表示单词和对于字符,n = 3-8)通常胜过此类功能的较小子集。当将所有命名实体简化为单个符号“ NE”时,我们还将评估字符n元语法,词汇功能及其组合的性能,以避免主题相关的功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号