首页> 外文会议>World Multiconference on Systemics, Cybernetics and Informatics >Nationality Identification from Names Using N-gram Based Cumulative Frequency Addition
【24h】

Nationality Identification from Names Using N-gram Based Cumulative Frequency Addition

机译:使用N-GRAM基于N-GRAM的累积频率添加国籍识别

获取原文

摘要

This paper describes two N-gram based classification methods -Cumulative Frequency Addition and Naive Bayesian - that identify a person's nationality, or at least the nationality group, from his/her name. Language classification using N-gram based methods has been shown to be highly accurate and insensitive to typographical errors, and, as a result, these methods have been extensively researched and documented in the natural language processing literature. However, there has been little research in using names to identify nationality, especially using N-gram based linguistic features. The two classifiers described here accomplish that goal efficiently on name data from the 2004 Olympic Games. Although similar in speed and accuracy, the novel Cumulative Frequency Addition method is somewhat simpler than the more conventional Naive Bayesian method. We obtained accuracies of 86% on a 14 country database and 96% on a 7 country database within the top 3 choices, which we argue is sufficient for applications like Text to Speech systems for significantly improving name pronunciation.
机译:本文介绍了两种基于N-GRAM的分类方法 - 调动频率加法和天真的贝叶斯 - 从他/她的名字中识别一个人的国籍或至少国籍小组。使用N-GRAM基础方法的语言分类已被证明对印刷错误的高度准确和不敏感,因此这些方法已被广泛研究和记录在自然语言处理文献中。然而,使用名称识别国籍的几乎没有研究,特别是使用基于n-gram的语言特征。这里描述的两个分类器在2004年奥运会中有效地完成了这个目标。虽然类似的速度和准确性,但新颖的累积频率添加方法比更传统的朴素贝叶斯方法更简单。我们在14个国家数据库中获得了86%的准确性,在前3个选择内的7个国家数据库中获得了96%,这是我们争辩的是在语音系统中的文本等应用程序,以便显着改善名称发音。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号