首页> 外文会议>Nordic conference of computational Linguistics >Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
【24h】

Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language

机译:比较特征表示的性能,以方便阅读的品种与标准语言进行分类

获取原文

摘要

We explore the effectiveness of four feature representations — bag-of-words, word embeddings, principal components and autoencoders - for the binary categorization of the easy-to-read variety vs standard language. 'Standard language' refers to the ordinary language variety used by a population as a whole or by a community, while the 'easy-to-read' variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoen-corders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.
机译:我们探讨了四种功能表示的有效性-词袋,单词嵌入,主成分和自动编码器-用于易于阅读的标准语言与标准语言的二进制分类。 “标准语言”是指整个人口或整个社区所使用的普通语言,而“易于阅读”的语言是标准语言的一种简单(或简化)版本。我们在三个语料库上测试了这些特征表示的效率,这三个语料库在大小,类平衡,分析单位,语言和主题方面有所不同。我们依靠有监督和无监督的机器学习算法。结果表明,词袋是此任务的强大而直接的特征表示形式,并且在许多实验环境中均能很好地执行。它的性能与主要组件和自动编码器所达到的性能相同或相等,但是其预处理更加耗时。对于此分类任务,单词嵌入的准确性不如其他特征表示。

著录项

  • 来源
  • 会议地点 Turku(FI)
  • 作者单位

    RISE Research Institutes of Sweden (Division ICT - RISE SICS East) Stockholm Sweden;

    Linköping University (IDA) Linköping Sweden;

    RISE Research Institutes of Sweden Linköping University (IDA) Linköping Sweden;

  • 会议组织
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

  • 入库时间 2022-08-26 14:42:09

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号