
A Modular Approach for Social Media Text Normalization




The normalized data is the backbone of various Natural Language Processing (NLP), Information Retrieval (IR), data mining, and Machine Translation (MT) applications. Thus, we propose an approach to normalize the colloquial and breviate text being posted on the social media like Twitter, Facebook, etc. The proposed approach for text normalization is based upon Levenshtein distance, demetaphone algorithm, and dictionary mappings. The standard dataset named lexnorm 1.2, containing English tweets is used to validate the proposed modular approach. Experimental results are compared with existing unsupervised approaches. It has been found that modular approach outperforms other exploited normalization techniques by achieving 83.6% of precision, recall, and F-scores. Also 91.1% of BLUE scores have been achieved.
机译:归一化数据是各种自然语言处理(NLP)的骨干,信息检索(IR),数据挖掘和机器翻译(MT)应用程序。 因此,我们提出了一种方法来规范化剧本和短语文本,如Twitter,Facebook等。所提出的文本规范化方法是基于Levenshtein距离,demetaphone算法和字典映射。 标准数据集命名为Lexnorm 1.2,包含英文推文用于验证所提出的模块化方法。 将实验结果与现有无监督的方法进行比较。 已经发现,模块化方法通过达到83.6%的精度,召回和F分数来实现其他利用的归一化技术。 还实现了91.1%的蓝色分数。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号