首页> 外文会议>Workshop on noisy user-generated text >Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data
【24h】

Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data

机译:阿尔及利亚代码转换用户生成数据中的标准化非标准化拼字法

获取原文

摘要

We work with Algerian, an under-resourced non-standardised Arabic variety, for which we compile a new parallel corpus consisting of user-generated textual data matched with normalised and corrected human annotations following data-driven and our linguistically motivated standard. We use an end-to-end deep neural model designed to deal with context-dependent spelling correction and normalisation. Results indicate that a model with two CNN sub-network encoders and an LSTM decoder performs the best, and that word context matters. Additionally, preprocessing data token-by-token with an edit-distance based aligner significantly improves the performance. We get promising results for the spelling correction and normalisation, as a pre-processing step for downstream tasks, on detecting binary Semantic Textual Similarity.
机译:我们与资源不足的非标准化阿拉伯语品种Algerian合作,为此我们编译了一个新的并行语料库,该语料库由用户生成的文本数据组成,该数据与数据驱动的和基于语言动机的标准相匹配,经过归一化和校正后的人类注释。我们使用了端对端的深度神经模型,旨在处理与上下文相关的拼写校正和规范化。结果表明,具有两个CNN子网编码器和LSTM解码器的模型表现最好,并且单词上下文很重要。此外,使用基于编辑距离的对齐器逐个令牌对数据进行预处理可以显着提高性能。作为对下游任务的预处理步骤,我们在检测二进制语义文本相似性方面的拼写更正和规范化工作获得了可喜的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号