首页> 外文会议>2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique >Phoneme inventory, trigrams and geographic location as features for clustering different philippine languages

Phoneme inventory, trigrams and geographic location as features for clustering different philippine languages


获取原文并翻译 | 示例


In this paper, orthographic, geographic and phonetic features were explored to cluster 32 Philippine languages and identify closely-related languages. For the orthographic data, we collected religious text documents online and 100,000 words per language were used as training data. These words were cleaned and trigram profiles were generated. For the geographic feature, we used the location where the language is spoken. For the phonetic feature, the phoneme inventory of the languages was utilized. The languages were clustered using two clustering algorithms, hierarchical and k-means algorithm. Purity was used as an evaluation metric to validate the clusters made. For both hierarchical clustering and k-means algorithm, the highest purity value of a cluster is 0.67, this is an indication that members in a particular cluster have similar attributes. As future work, semantic features can be added to improve the data set and additional languages can be considered.



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号