首页> 外文会议>International Conference on speech and computer >A Trainable Method for the Phonetic Similarity Search in German Proper Names
【24h】

A Trainable Method for the Phonetic Similarity Search in German Proper Names

机译:德国专有名称语音相似性搜索的一种可训练方法

获取原文

摘要

Efficient methods for the similarity search in word databases play a significant role in various applications such as the robust search or indexing of names and addresses, spell-checking algorithms or the monitoring of trademark rights. The underlying distance measures are associated with similarity criteria of the users, and phonetic-based search algorithms are well-established since decades. Nonetheless, rule-based phonetic algorithms exhibit some weak points, e.g. their strong language dependency, the search overhead by tolerance or the risk of missing valid matches vice versa, which causes a pseudo-phonetic functionality in some cases. In contrast, we suggest a novel, adaptive method for similarity search in words, which is based on a trainable grapheme-to-phoneme (G2P) converter that generates most likely and widely correct pronunciations. Only as a second step, the similarity search in the phonemic reference data is performed by involving a conventional string metric such as the Levenshtein distance (LD). The G2P algorithm achieves a string accuracy of up to 99.5% in a German pronunciation lexicon and can be trained for different languages or specific domains such as proper names. The similarity tolerance can be easily adjusted by parameters like the admissible number or likability of pronunciation variants as well as by the phonemic or graphemic LD. As a proof of concept, we compare the G2P-based search method on a German surname database and a telephone book including first name, surname and street name to similarity matches by the conventional Cologne phonetic (Koelner Phonetik, KP) algorithm.
机译:在单词数据库中进行相似性搜索的有效方法在各种应用中都起着重要作用,例如强大的名称或地址搜索或索引,拼写检查算法或商标权利的监视。潜在的距离度量与用户的相似性标准相关联,并且基于语音的搜索算法已有数十年的历史了。尽管如此,基于规则的语音算法仍存在一些弱点,例如它们具有强大的语言依赖性,按容忍度进行搜索的开销或缺少有效匹配项的风险,反之亦然,这在某些情况下会导致伪语音功能。相反,我们建议一种新颖的,自适应的单词相似性搜索方法,该方法基于可训练的音素到音素(G2P)转换器,该转换器生成最可能且最正确的发音。仅作为第二步骤,通过涉及诸如莱文施泰因距离(LD)的常规字符串度量来执行音素参考数据中的相似性搜索。 G2P算法在德语发音词典中可达到高达99.5%的字符串精度,并且可以针对不同的语言或特定领域(例如专有名称)进行训练。相似容忍度可以通过诸如发音变体的可允许数量或喜好度之类的参数以及音素或音素LD轻松调整。作为概念的证明,我们在德国姓氏数据库和电话簿(包括名字,姓氏和街道名称)上的基于G2P的搜索方法进行了比较,并通过传统的科隆注音(Koelner Phonetik,KP)算法进行了相似性匹配。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号