...
首页> 外文期刊>Information retrieval >On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
【24h】

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

机译:关于高变形语言的人名匹配和词形化的知识贫乏方法

获取原文
获取原文并翻译 | 示例
           

摘要

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6-99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.
机译:网络人搜索是互联网用户最常见的活动之一。最近,已经报道了在大型Web文档集中应用各种NLP技术消除人名歧义的大量工作,其中主要重点是英语和其他几种主要语言。本文介绍了波兰语中知识匮乏的方法,该方法用于解决人名匹配和词形化问题,波兰语是一种复杂的人名变形范例的高拐点语言。这些方法主要应用公认的字符串距离度量,其一些新变体,自动获取的基于简单后缀的词条抽取模式以及上述技术的某些组合。此外,我们还对使用人名出现的上下文的部署技术进行了一些初始实验。提出了许多实验的结果。对从一组在线新闻文章中提取的数据集进行的评估显示,实现高于90%的去词化准确性数字似乎很困难,而将字符串距离度量与基于后缀的模式结合使用,则可以达到97.6-99%的准确性用于名称匹配任务。有趣的是,通过整合一些基本技术(尝试利用名称的出现在当地环境中),无法获得显着的额外收益。尽管我们的探索集中在波兰语上,但我们认为本文中介绍的工作构成了解决波兰语的实用指南对于其他具有类似现象的高度变形的语言,也存在相同的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号