【24h】

Name Phylogeny: A Generative Model of String Variation

机译:名称系统发生:字符串变化的生成模型

获取原文

摘要

Many linguistic and textual processes involve transduc-tion of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Leven-shtein distance.
机译:许多语言和文本过程都涉及到字符串的转换。我们展示了如何从无序的字符串集合(而不是字符串对)中学习随机换能器。传感器的作用是组织收集。我们的生成模型通过假设集合中的某些字符串不是从头开始生成,而是通过从集合中其他“相似”字符串的转导中得出的,来解释字符串之间的相似性。我们的变分EM学习算法交替地重新估计了系统发育和换能器参数。最终的学习型传感器可以将任何测试名称快速链接到最终的系统发育,从而找到测试名称的变体。我们发现,我们的方法可以有效地在用于指代Wikipedia中人员的Web字符串语料库中找到名称变体,从而改善了标准的未经训练的距离,例如Jaro-Winkler和Leven-shtein距离。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号