A Trainable Method for the Phonetic Similarity Search in German Proper Names

机译：德国专有名称语音相似性搜索的一种可训练方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Efficient methods for the similarity search in word databases play a significant role in various applications such as the robust search or indexing of names and addresses, spell-checking algorithms or the monitoring of trademark rights. The underlying distance measures are associated with similarity criteria of the users, and phonetic-based search algorithms are well-established since decades. Nonetheless, rule-based phonetic algorithms exhibit some weak points, e.g. their strong language dependency, the search overhead by tolerance or the risk of missing valid matches vice versa, which causes a pseudo-phonetic functionality in some cases. In contrast, we suggest a novel, adaptive method for similarity search in words, which is based on a trainable grapheme-to-phoneme (G2P) converter that generates most likely and widely correct pronunciations. Only as a second step, the similarity search in the phonemic reference data is performed by involving a conventional string metric such as the Levenshtein distance (LD). The G2P algorithm achieves a string accuracy of up to 99.5% in a German pronunciation lexicon and can be trained for different languages or specific domains such as proper names. The similarity tolerance can be easily adjusted by parameters like the admissible number or likability of pronunciation variants as well as by the phonemic or graphemic LD. As a proof of concept, we compare the G2P-based search method on a German surname database and a telephone book including first name, surname and street name to similarity matches by the conventional Cologne phonetic (Koelner Phonetik, KP) algorithm.

机译：在单词数据库中进行相似性搜索的有效方法在各种应用中都起着重要作用，例如强大的名称或地址搜索或索引，拼写检查算法或商标权利的监视。潜在的距离度量与用户的相似性标准相关联，并且基于语音的搜索算法已有数十年的历史了。尽管如此，基于规则的语音算法仍存在一些弱点，例如它们具有强大的语言依赖性，按容忍度进行搜索的开销或缺少有效匹配项的风险，反之亦然，这在某些情况下会导致伪语音功能。相反，我们建议一种新颖的，自适应的单词相似性搜索方法，该方法基于可训练的音素到音素（G2P）转换器，该转换器生成最可能且最正确的发音。仅作为第二步骤，通过涉及诸如莱文施泰因距离（LD）的常规字符串度量来执行音素参考数据中的相似性搜索。 G2P算法在德语发音词典中可达到高达99.5％的字符串精度，并且可以针对不同的语言或特定领域（例如专有名称）进行训练。相似容忍度可以通过诸如发音变体的可允许数量或喜好度之类的参数以及音素或音素LD轻松调整。作为概念的证明，我们在德国姓氏数据库和电话簿（包括名字，姓氏和街道名称）上的基于G2P的搜索方法进行了比较，并通过传统的科隆注音（Koelner Phonetik，KP）算法进行了相似性匹配。

著录项

来源
《International Conference on speech and computer》|2017年|46-55|共10页
会议地点
作者
Oliver Jokisch; Horst-Udo Hain;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Phonetic similarity search; Trainable G2P; Levenshtein distance;

机译：语音相似度搜索;可训练的G2P;莱文施泰因距离;

相似文献

外文文献
中文文献
专利

1. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese [J] . Hegler Tissot, Richard Dobson Journal of Biomedical Semantics . 2019,第1aSupplement期

机译：结合字符串和语音相似性匹配，以识别葡萄牙语中撰写的医疗记录中药物的错过胶片名称
2. Phonetic search methods for large speech databases [J] . G. R. Mayforth Computing reviews . 2014,第1期

机译：大型语音数据库的语音搜索方法
3. Intelligent trademark similarity analysis of image, spelling, and phonetic features using machine learning methodologies [J] . Charles V. Trappey, Amy J.C. Trappey, Sam C.-C. Lin Advanced engineering informatics . 2020,第Auga期

机译：使用机器学习方法的图像，拼写和语音特征的智能商标相似性分析
4. A Trainable Method for the Phonetic Similarity Search in German Proper Names [C] . Oliver Jokisch, Horst-Udo Hain International Conference on Speech and Computer . 2017

机译：德国专有名称中语音相似性搜索的培训方法
5. "A Thousand Names They Called Him" Naming and Proper Names in the work of S. Y. Agnon. [D] . Hadad, Shira. 2012

机译：S. Y. Agnon著作中的“千个名字叫他”命名和专有名称。
6. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese [O] . Hegler Tissot, Richard Dobson 2019

机译：结合字符串和语音相似性匹配以识别葡萄牙语书写的医疗记录中药物的拼写错误名称
7. Proper names in German-language mass media (prospects in teaching methodology and new opportunities) [O] . Irina Maslova 2020

机译：德语大众媒体的适当名称（教学方法论和新机遇的前景）

A Trainable Method for the Phonetic Similarity Search in German Proper Names

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅