首页> 外文会议>Nordic conference of computational Linguistics >Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
【24h】

Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language

机译:比较特征表示的性能,以便分类易于阅读的品种VS标准语言

获取原文

摘要

We explore the effectiveness of four feature representations — bag-of-words, word embeddings, principal components and autoencoders - for the binary categorization of the easy-to-read variety vs standard language. "Standard language" refers to the ordinary language variety used by a population as a whole or by a community, while the "easy-to-read" variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoen-corders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.
机译:我们探讨了四个特征表示的有效性 - 文字袋,Word Embeddings,主成分和AutoEncoders - 用于易于阅读的品种与标准语言的二进制分类。 “标准语言”是指整个人口或社区使用的普通语言品种,而“易于阅读”的品种是标准语言的简单(或简化)版本。我们在三个语料库上测试这些特征表示的效率,这些特征表示的大小不同,班级平衡,分析单位,语言和主题。我们依靠监督和无监督的机器学习算法。结果表明,文字袋是该任务的强大和简单的特征表示,并且在许多实验设置中表现良好。其性能等同于或等于使用主成分和自动系列的性能,其预处理变得更加耗时。 Word Embeddings比此分类任务的其他特征表示不太准确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号