Nationality Identification from Names Using N-gram Based Cumulative Frequency Addition

机译：使用N-GRAM基于N-GRAM的累积频率添加国籍识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper describes two N-gram based classification methods -Cumulative Frequency Addition and Naive Bayesian - that identify a person's nationality, or at least the nationality group, from his/her name. Language classification using N-gram based methods has been shown to be highly accurate and insensitive to typographical errors, and, as a result, these methods have been extensively researched and documented in the natural language processing literature. However, there has been little research in using names to identify nationality, especially using N-gram based linguistic features. The two classifiers described here accomplish that goal efficiently on name data from the 2004 Olympic Games. Although similar in speed and accuracy, the novel Cumulative Frequency Addition method is somewhat simpler than the more conventional Naive Bayesian method. We obtained accuracies of 86% on a 14 country database and 96% on a 7 country database within the top 3 choices, which we argue is sufficient for applications like Text to Speech systems for significantly improving name pronunciation.

机译：本文介绍了两种基于N-GRAM的分类方法 - 调动频率加法和天真的贝叶斯 - 从他/她的名字中识别一个人的国籍或至少国籍小组。使用N-GRAM基础方法的语言分类已被证明对印刷错误的高度准确和不敏感，因此这些方法已被广泛研究和记录在自然语言处理文献中。然而，使用名称识别国籍的几乎没有研究，特别是使用基于n-gram的语言特征。这里描述的两个分类器在2004年奥运会中有效地完成了这个目标。虽然类似的速度和准确性，但新颖的累积频率添加方法比更传统的朴素贝叶斯方法更简单。我们在14个国家数据库中获得了86％的准确性，在前3个选择内的7个国家数据库中获得了96％，这是我们争辩的是在语音系统中的文本等应用程序，以便显着改善名称发音。

著录项

来源
《World Multiconference on Systemics, Cybernetics and Informatics》|2005年||共6页
会议地点
作者
Bashir Ahmed; Sung-Hyuk Cha; Charles Tappert;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 N945-53;
关键词
Nationality Identification; Language Identification; Cumulative Frequency Addition; Naive Bayesian Classification; Information Retrieval; Machine Translation;

机译：国籍鉴定;语言识别;累积频率增加;天真贝叶斯分类;信息检索;机器翻译;

相似文献

外文文献
中文文献
专利

1. Age of acquisition persists as the main factor in picture naming when cumulative word frequency and frequency trajectory are controlled [J] . Perez MA The quarterly journal of experimental psychology: QJEP . 2007,第1期

机译：当控制累积单词频率和频率轨迹时，获取年龄一直是图片命名的主要因素
2. Algorithmically generated malicious domain names detection based on n-grams features [J] . Cucchiarelli Alessandro, Morbidoni Christian, Spalazzi Luca, Expert systems with applications . 2021,第May期

机译：基于N-GRAMS功能的算法生成的恶意域名检测
3. Malicious Domain Names Detection Algorithm Based on N-Gram [J] . Hong Zhao, Zhaobin Chang, Guangbin Bao, Journal of computer networks and communications . 2019,第1期

机译：基于n-gram的恶意域名检测算法
4. Nationality Identification from Names Using N-gram Based Cumulative Frequency Addition [C] . Bashir Ahmed, Sung-Hyuk Cha, Charles Tappert World Multiconference on Systemics, Cybernetics and Informatics . 2005

机译：使用N-GRAM基于N-GRAM的累积频率添加国籍识别
5. A Channel Capacity Based Attack to Quantify the Security of N-Gram Based Anomaly Detection Approaches [D] . Shanahan, Nicholas. 2017

机译：基于信道容量的攻击，以量化N-Gram基异常检测方法的安全性
6. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2020

机译：无监督的象征自然语言惯用单位的收购：新闻文章和推文的分组的基于n克频率的方法
7. Business Process Models Clustering Based on Multimodal Search, K-means, and Cumulative and No-Continuous N-Grams [O] . Hugo Ordoñez, Luis Merchán, Armando Ordoñez, 2016

机译：基于多模式搜索，K均值和累积和无连续n-gram的业务流程模型集群

Nationality Identification from Names Using N-gram Based Cumulative Frequency Addition

摘要

著录项

相似文献

相关主题

期刊订阅