...
首页> 外文期刊>Natural language engineering >Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora
【24h】

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

机译:中文区域和流派品种分类:基于可比平衡的对应分析方法

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath-Altmann law, provide effective models in classifications of similar languages.
机译:本文提出了一种强大的文本分类和对应分析方法来识别类似语言。特别是,我们建议使用易于提供的子句和字长分布信息来模拟类似语言。建模和分类基于语言是自适应复杂系统的假设,因此可以通过描述系统的动态特征来分类,尤其是在系统的组分的分布关系方面。对于类似的语言,其语法差异往往是微妙的,基于动态系统特征的分类应该更有效。为了测试这一假设,我们考虑了普通话的区域和流派品种进行分类。数据从两个可比较的平衡数公司中提取,以尽量减少可能的混杂因素。这两位公司是来自台湾的Sinica Corpus和来自中国大陆的普通话中的兰开斯特·中国人,两种类型是鉴定和审查。我们的文本分类和对应分析结果表明,单词和条款之间的电力功能组合功率功能的语言上富有的两级选项模型有效地对两种类型进行了两种汉语。此外,我们发现对区域品种的分类确实具有复杂的影响。特别是,在两个品种中的报道更可能被归类于审查,证实了语言变化的复杂系统视图。也就是说,语言变化和变化通常不会均匀地均匀地突然用于完整的语言系统。这进一步增强了我们的假设,即动态复杂的系统特征,例如Menzerath-Altmann法所捕获的功率功能,在类似语言的分类中提供有效的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号