【24h】

Automatically Constructing a Normalisation Dictionary for Microblogs

机译:自动构建微博规范化字典

获取原文

摘要

Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highly-ranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing.
机译:微博常规化方法通常利用复杂的模型,并努力区分正确拼写的未知单词和已知词的词汇变种。在本文中,我们提出了一种构建知识词典的词典变体字典的方法,其通过简单的字符串替换来促进词汇标准化(例如,明天用于TMRW)。我们使用上下文信息来生成可能的变体和归一化对,然后通过字符串相似性对这些进行排名。选择高度排名对填充字典。我们表明,基于字典的方法实现了标准数据集上的F分数和字错误率的最先进的性能。与其他方法相比,这种方法提供了快速,轻巧且易于使用的解决方案,因此适用于大容量微博预处理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号