首页> 外文会议>International Conference on Image Processing and Robotics >Standardizing Sinhala Code-Mixed Text using Dictionary based Approach
【24h】

Standardizing Sinhala Code-Mixed Text using Dictionary based Approach

机译:使用基于字典的方法标准化Sinhala代码混合文本

获取原文

摘要

Code-mixing is one of the biggest challenges when processing social media text. This paper presents a thorough review on the state of the art code-mixed text processing and identified the main challenges in processing Sinhala code-mixed text. In this study we could identify how researchers conducted different kinds of tasks such as normalization of code-mixed data, word level language identification of the code-mixed text etc. The study lead to identify the challenges in Sinhala code-mixed text such as phonetic transliterations, borrowing of words, spelling errors, embedded languages, the use of numeric characters in words, discourse marker switching etc. Based on this challenges identified, it was necessary to standardize the Singlish text to Sinhala letters, since there are so many representations for the same word. So a dictionary is proposed where Sinhala letters are mapped to Singlish text which could be used as a standardization. Finally the paper discuss about the future work planed on using the proposed dictionary for Sinhala code-mixed data analysis.
机译:代码混合是处理社交媒体文本时最大的挑战之一。本文对最新的艺术码混合文本处理彻底审查,并确定了处理Sinhala Code-Micric文本的主要挑战。在这项研究中,我们可以识别研究人员如何进行不同类型的任务,例如代码混合数据的标准化,代码混合文本的单词级语言识别等。该研究导致识别诸如语音等僧伽加码混合文本中的挑战音译,借用单词,拼写错误,嵌入语言,单词,话语标记切换等的数字字符的使用基于识别的挑战,有必要将单字文本标准化为僧伽罗字母,因为有这么多的陈述同一个词。因此,提出了一个字典,其中Sinhala字母被映射到单打文本,可以用作标准化。最后,论文讨论了使用所提出的Sinhala Code-Mixed数据分析典范的未来工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号