首页> 外文期刊>IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics >Mining Pinyin-to-character conversion rules from large-scale corpus: a rough set approach
【24h】

Mining Pinyin-to-character conversion rules from large-scale corpus: a rough set approach

机译:从大型语料库挖掘拼音到字符的转换规则:一种粗糙集方法

获取原文
获取原文并翻译 | 示例
       

摘要

The paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Data generalization and rule extraction algorithms can then be used to eliminate redundant information and extract consistent PTC conversion rules. The design of our model also addresses a number of important issues such as the long-distance dependency problem, the storage requirements of the rule base, and the consistency of the extracted rules, while the performance of the extracted rules as well as the effects of different model parameters are evaluated experimentally. These results show that by the smoothing method, high precision conversion (0.947) and recall rates (0.84) can be achieved even for rules represented directly by pinyin rather than words. A comparison with the baseline tri-gram model also shows good complement between our method and the tri-gram language model.
机译:本文介绍了一种粗糙集技术,用于解决挖掘拼音到字符(PTC)转换规则的问题。它首先提出了一种文本构造方法,即通过从每个拼音的语料库构建语言信息表,然后将其应用于自由格式的文本语料库。然后,可以使用数据概括和规则提取算法来消除冗余信息并提取一致的PTC转换规则。我们模型的设计还解决了许多重要问题,例如长距离依赖问题,规则库的存储要求以及提取的规则的一致性,而提取的规则的性能以及实验评估了不同的模型参数。这些结果表明,通过平滑方法,即使对于直接由拼音而不是单词表示的规则,也可以实现高精度转换(0.947)和召回率(0.84)。与基线三元语法模型的比较也显示了我们的方法和三元语法语言模型之间的良好互补。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号