首页> 外文会议>Workshop on language in social media >Processing Informal, Romanized Pakistani Text Messages
【24h】

Processing Informal, Romanized Pakistani Text Messages

机译:处理非正式,罗马化的巴基斯坦短信

获取原文

摘要

Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%.
机译:无论语言如何,为短信(SMS)和许多其他社交媒体平台设置的标准字符是罗马字母表。有些字符集有罗马化约定,但它们在非正式文本中不一致地使用,例如短信。在这项工作中,我们将非正式的罗马核心邮件转换为原生阿拉伯语脚本并使非标准短信语言正常化。这样做准备了现有下游处理工具的消息,例如机器翻译,这些工具通常在良好的本机脚本文本上培训。我们的模型将信息与字符级别相结合,允许它处理词汇外项目。与基线确定性方法相比,我们的系统将单词和字符错误率降低超过50%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号