【24h】

Web-based acquisition of Japanese katakana variants

机译:基于网络的日语片假名变种的获取

获取原文

摘要

This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous works manually defined Katakana rewrite rules such as %Y (be) and %t%' (ve) being replaceable with each other, for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. Thus, the advantages of our method are: 1) need no expertise in linguistics to determine weight of each operation, and 2) able to keep up with new Katakana loanwords only by collecting text data from Web and acquiring new weights of edit operations automatically. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from 38 year's worth of corpora of Japanese newspaper articles.
机译:本文介绍了一种从大型语料库中检测日语片假名变体的方法。片假名单词(主要用作外来单词)会导致信息检索等问题,因为音译会在拼写上产生多种变体,并且所有这些都可以是正交的。以前的作品手动定义了片假名重写规则,例如%Y( be )和%t%'( ve )可相互替换,以生成变体并定义权重将每个字符串编辑成另一个字符串以检测这些变体的操作。但是,这些先前的研究未能跟上外来词及其变体的数量不断增加的趋势。利用本文提出的方法,可以基于Web数据机械地分配每个编辑操作的权重。在实验中,它的性能几乎与手动确定的砝码一样好。因此,我们方法的优点是:1)不需要语言学专业知识即可确定每个操作的权重,并且2)仅通过从Web收集文本数据并自动获取新的编辑操作权重,才能跟上新的片假名借用词。从38年的日本报纸文章语料库中提取片假名变体对的任务中,它还实现了98.6%的召回率和86.3%的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号