首页> 外文会议>2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique >Phoneme inventory, trigrams and geographic location as features for clustering different philippine languages
【24h】

Phoneme inventory, trigrams and geographic location as features for clustering different philippine languages

机译:音素清单,三字组和地理位置是聚类不同菲律宾语言的功能

获取原文
获取原文并翻译 | 示例

摘要

In this paper, orthographic, geographic and phonetic features were explored to cluster 32 Philippine languages and identify closely-related languages. For the orthographic data, we collected religious text documents online and 100,000 words per language were used as training data. These words were cleaned and trigram profiles were generated. For the geographic feature, we used the location where the language is spoken. For the phonetic feature, the phoneme inventory of the languages was utilized. The languages were clustered using two clustering algorithms, hierarchical and k-means algorithm. Purity was used as an evaluation metric to validate the clusters made. For both hierarchical clustering and k-means algorithm, the highest purity value of a cluster is 0.67, this is an indication that members in a particular cluster have similar attributes. As future work, semantic features can be added to improve the data set and additional languages can be considered.
机译:在本文中,对正交,地理和语音特征进行了探索,以聚类32种菲律宾语言并识别紧密相关的语言。对于正字数据,我们在线收集了宗教文本文档,每种语言的100,000个单词被用作训练数据。这些单词被清除,并生成了trigram配置文件。对于地理特征,我们使用了使用该语言的位置。对于语音功能,使用了语言的音素清单。语言使用两种聚类算法(层次和k-means算法)进行聚类。纯度用作评估指标以验证制成的簇。对于分层聚类和k-均值算法,一个聚类的最高纯度值为0.67,这表明特定聚类中的成员具有相似的属性。在将来的工作中,可以添加语义功能以改善数据集,并可以考虑其他语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号