首页> 外文会议>International Conference of the CLEF Association >Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

【24h】

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

机译：在同一西班牙语新闻语料库上的作者，性别和语言品种识别的角色n-grams和词汇特征的比较

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We compare the performance of character n-gram features (n = 3-8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n = 5-8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n = 1-2 for words and n = 3-8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol "NE" to avoid topic-dependent features.

机译：我们比较角色n-gram特征（n = 3-8）和词汇特征（单词和大字谜）的性能，以及他们的组合，以及作者归因，作者分析和相似语言之间的歧视。我们为三个上述任务开发了一个多标签语料库，由西班牙语不同品种的新闻文章组成。我们使用了相同的机器学习算法Liblinear SVM，以了解哪些功能更具预测性和哪个任务。我们的实验表明，高阶字符n-gram（n = 5-8）优于低阶字符n-gram，以及不同订单的所有单词和字符n-gram的组合（n = 1-2的单词和n = 3-8字符）通常优于这些特征的较小子集。当将所有命名实体减少到单个符号“ne”时，我们还评估字符n-grams，词汇功能及其组合的性能，以避免依赖于主题的功能。

著录项

来源
《International Conference of the CLEF Association 》|2017年|378p|共7页
会议地点
作者
Miguel A. Sanchez-Perez; Ilia Markov; Helena Gomez-Adorno; Grigori Sidorov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP312-53;
关键词
Feature selection; Authorship attribution; Author profiling; Discriminating between similar languages; Lexical features; Character n-grams;

机译：特征选择;作者归因;作者分析;在类似语言之间区分;词汇特征;字符n-grams;

相似文献

外文文献
中文文献
专利

1. Authorship attribution of Spanish poems using n-grams and the Web as Corpus [J] . Guzman-Cabrera Rafael Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2020 ,第2Pta2期

机译：西班牙诗歌用n-gram和web作为语料库的作者归属
2. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features [J] . Zachary Miller, Brian Dickinson, Wei Hu International Journal of Intelligence Science . 2012 ,第4期

机译：使用具有N-Gram字符特征的流算法在Twitter上进行性别预测
3. Combining n-grams and deep convolutional features for language variety classification [J] . Martinc Matej, Pollak Senja Natural language engineering . 2019 ,第5期

机译：结合n元语法和深度卷积特征进行语言多样性分类
4. Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus [C] . Miguel A. Sanchez-Perez, Ilia Markov, Helena Gomez-Adorno, International conference of the CLEF Association . 2017

机译：同一西班牙新闻语料库上的字符n-gram和作者，性别和语言多样性识别的词汇特征的比较
5. A content analytic comparison of news frames in English- and Spanish-language newspapers. [D] . Dulcan, Emily. 2006

机译：英文和西班牙文报纸新闻框架的内容分析比较。
6. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2020

机译：无监督的象征自然语言惯用单位的收购：新闻文章和推文的分组的基于n克频率的方法
7. Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk) [O] . Paolo Rosso 2017

机译：在潘作者分析：从年龄和性别识别到语言品种识别（邀请谈话）

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

摘要

著录项

相似文献

相关主题

期刊订阅