首页> 外文期刊>ACM transactions on Asian language information processing >A Seed-Based Method for Generating Chinese Confusion Sets

A Seed-Based Method for Generating Chinese Confusion Sets


获取原文并翻译 | 示例


In natural language, people often misuse a word (called a "confused word") in place of other words (called "confusing words"). In misspelling corrections, many approaches to finding and correcting misspelling errors are based on a simple notion called a "confusion set." The confusion set of a confused word consists of confusing words. In this article, we propose a new method of building Chinese character confusion sets.Our method is composed of two major phases. In the first phase, we build a list of seed confusion sets for each Chinese character, which is based on measuring similarity in character pinyin or similarity in character shape. In this phase, all confusion sets are constructed manually, and the confusion sets are organized into a graph, called a "seed confusion graph" (SCG), in which vertices denote characters and edges are pairs of characters in the form (confused character, confusing character).In the second phase, we extend the SCG by acquiring more pairs of (confused character, confusing character) from a large Chinese corpus. For this, we use several word patterns (or patterns) to generate new confusion pairs and then verify the pairs before adding them into a SCG. Comprehensive experiments show that our method of extending confusion sets is effective. Also, we shall use the confusion sets in Chinese misspelling corrections to show the utility of our method.
机译:在自然语言中,人们经常误用一个单词(称为“混淆单词”)代替其他单词(称为“混淆单词”)。在拼写错误的更正中,发现和纠正拼写错误的许多方法都是基于一个简单的概念,即“混淆集”。混淆单词的混淆集由混淆单词组成。在本文中,我们提出了一种构建汉字混淆集的新方法。 r n我们的方法由两个主要阶段组成。在第一阶段,我们基于测量汉字拼音相似度或汉字形状相似度,为每个汉字建立了种子混淆集列表。在此阶段,所有混淆集都是手动构建的,并且混淆集被组织成一个称为“种子混淆图”(SCG)的图形,其中顶点表示字符,而边是形式成对的字符对(混淆 r n在第二阶段中,我们通过从大型中文语料库中获取更多对(混淆字符,混淆字符)来扩展SCG。为此,我们使用几种单词模式(或多个模式)生成新的混淆对,然后在将其添加到SCG中之前对其进行验证。综合实验表明,我们的扩展混淆集的方法是有效的。此外,我们将在中文拼写错误更正中使用混淆集,以显示我们的方法的实用性。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号