首页> 外文会议>International conference of the CLEF Association >Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

【24h】

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

机译：同一西班牙新闻语料库上的字符n-gram和作者，性别和语言多样性识别的词汇特征的比较

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We compare the performance of character n-gram features (n = 3-8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n = 5-8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n = 1-2 for words and n = 3-8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol "NE" to avoid topic-dependent features.

机译：我们比较了字符n-gram特征（n = 3-8）和词汇特征（单词的字母组合和双字母组合）及其组合在作者身份归因，作者概要分析和区分相似语言的任务上的性能。我们为上述三个任务开发了一个带有多个标签的语料库，该语料库由西班牙语中不同品种的新闻文章组成。我们使用相同的机器学习算法Liblinear SVM，以找出哪些功能更具预测性，以及针对哪些任务。我们的实验表明，高阶字符n-gram（n = 5-8）优于低阶字符n-gram，以及所有单词和不同阶数的字符n-gram的组合（n = 1-2表示单词和对于字符，n = 3-8）通常胜过此类功能的较小子集。当将所有命名实体简化为单个符号“ NE”时，我们还将评估字符n元语法，词汇功能及其组合的性能，以避免主题相关的功能。

著录项

来源
《International conference of the CLEF Association》|2017年|145-151|共7页
会议地点
作者
Miguel A. Sanchez-Perez; Ilia Markov; Helena Gomez-Adorno; Grigori Sidorov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Feature selection; Authorship attribution; Author profiling; Discriminating between similar languages; Lexical features; Character n-grams;

机译：功能选择;著作权归属;作者简介;区分相似的语言;词汇特征;字符n-gram;

相似文献

外文文献
中文文献
专利

1. Authorship attribution of Spanish poems using n-grams and the Web as Corpus [J] . Guzman-Cabrera Rafael Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2020,第2Pta2期

机译：西班牙诗歌用n-gram和web作为语料库的作者归属
2. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features [J] . Zachary Miller, Brian Dickinson, Wei Hu International Journal of Intelligence Science . 2012,第4期

机译：使用具有N-Gram字符特征的流算法在Twitter上进行性别预测
3. Combining n-grams and deep convolutional features for language variety classification [J] . Martinc Matej, Pollak Senja Natural language engineering . 2019,第5期

机译：结合n元语法和深度卷积特征进行语言多样性分类
4. Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus [C] . Miguel A. Sanchez-Perez, Ilia Markov, Helena Gomez-Adorno, International Conference of the CLEF Association . 2017

机译：在同一西班牙语新闻语料库上的作者，性别和语言品种识别的角色n-grams和词汇特征的比较
5. A content analytic comparison of news frames in English- and Spanish-language newspapers. [D] . Dulcan, Emily. 2006

机译：英文和西班牙文报纸新闻框架的内容分析比较。
6. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2020

机译：无监督的象征自然语言惯用单位的收购：新闻文章和推文的分组的基于n克频率的方法
7. Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk) [O] . Paolo Rosso 2017

机译：在潘作者分析：从年龄和性别识别到语言品种识别（邀请谈话）

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

摘要

著录项

相似文献

相关主题

期刊订阅