首页> 外文OA文献 >A study on Chinese register characteristics based on regression analysis and text clustering
【2h】

A study on Chinese register characteristics based on regression analysis and text clustering

机译:基于回归分析和文本聚类的中文注册特征研究

摘要

This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aLbcL, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.
机译:本文报告了一项基于回归分析的创新汉语注册研究,用于句子长度分布和文本聚类。尽管句子的结尾通常没有中文标记,但我们通过假设句点,问号和感叹号之间的句段是句子来解决此问题,可以将其进一步分为简单句子和复合句子。我们还假定标点符号之间的段表示句子中的停顿(即从句)。通过回归分析,我们发现汉语句子和从句长度的频率分布可以通过公式F = aLbcL来拟合,其中L是句子/从句的长度。来自不同寄存器的文本会产生不同的参数拟合值,因此可以用来区分这些寄存器。最后,我们使用这些参数来表示和聚类来自不同寄存器的文本。成功的文本聚类结果进一步证明了拟合结果的参数是不同寄存器的可靠语言特性。就语言理论而言,我们的研究表明,使用社会学单词(即字符)对中文句子长度进行建模与使用语言单词一样有效。

著录项

  • 作者

    Hou R; Huang CR; Liu H;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号