首页> 外文会议>IEEE International Conference on Application of Information and Communication Technologies >Initial Normalization of User Generated Content: Case Study in a Multilingual Setting
【24h】

Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

机译:用户生成内容的初始规范化:多语言环境下的案例研究

获取原文

摘要

We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.
机译:我们解决了在多语言环境中规范用户生成的内容的问题。具体来说,我们的目标受众是哈萨克斯坦流行的互联网新闻媒体的评论栏,评论几乎总是以哈萨克语或俄语或两者混合出现。此外,这样的注释是嘈杂的,即由于(主要是)故意违反拼写约定而导致处理困难,这加剧了数据稀疏性问题。因此,我们提出了一种简单而有效的归一化方法来解决多语言输入的问题。我们在语言识别和情感分析的任务上进行了外部评估,表明在两种情况下规范化都可以提高整体准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号