Roman to Gurmukhi Social Media Text Normalization

Jagroop Kaur; Jaswinder Singh

摘要

Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in regular text.Recently,a considerable amount of work has been done in this direction,but mostly in the English language.People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script.This kind of text further aggravates the problem of normalizing.This paper aims to discuss the concept of normalization with respect to code-mixed social media text,and a model has been proposed to normalize such text.Design/methodology/approach-The system is divided into two phases-candidate generation and most probable sentence selection.Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language.Characterbased translation system has been proposed to generate candidate tokens.Once candidates are generated,the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.Findings-Character error rate(CER)and bilingual evaluation understudy(BLEU)score are reported.The proposed system has been compared with Akhar software and RB_R2G system,which are also capable of transliterating Roman text to Gurmukhi.The performance of the system outperforms Akhar software.The CER and BLEU scores are 0.268121 and 0.6807939,respectively,for ill-formed text.Research limitations/implications-It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing.Spell checker can improve the output of the system by correcting these minor errors.Extensive experimentation is needed for optimizing language identifier,which will further help in improving the output.The language model also seeks further exploration.Inclusion of wider context,particularly from social media text,is an important area that deserves further investigation.Practical implications-The practical implications of this study are:(1)development of parallel dataset containing Roman and Gurmukhi text;(2)development of dataset annotated with language tag;(3)development of the normalizing system,which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi.It can be extended for any pair of scripts.(4)The proposed system can be used for better analysis of social media text.Theoretically,our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/value-Existing research work focus on normalizing monolingual text.This study contributes towards the development of a normalization system for multilingual text.

Roman to Gurmukhi Social Media Text Normalization

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅