首页> 外文会议>IEEE International Conference on Artificial Intelligence and Computer Applications >Chinese Sentence Pattern Feature Extraction Based on Massive Data Analysis
【24h】

Chinese Sentence Pattern Feature Extraction Based on Massive Data Analysis

机译:基于海量数据分析的汉语句子模式特征提取

获取原文

摘要

In the era of Data Technology, the data is characterized by huge scale, modal diversity, and rapid growth. The worth of corpus related to Chinese is also increased by multiplication correspondingly. Based on one of the Chinese language processing systems called the Language Technology Platform (LTP), using the Data Mining and the Machine Learning to extract and apply Chinese sentence features is a new perspective and entry point in the field of Chinese information processing. In this paper, the dependency grammar is selected for sentence pattern analysis, and the text representation model consisting of sequences and vectors is established. A specialized “Chinese Sentence Pattern Retrieve Library” including 1,032,480 sentences and 92,451 kinds of sentence patterns is built to provide a sentence pattern database service for more special sentence patterns studies. On the basis of this database, relevant statistics and preliminary analysis are made on the sentence patterns of different genres articles. It is found that there are about 2,000 core sentence patterns in Chinese and commonly used sentence patterns are relatively concentrated, with the frequency of the 10 sentence patterns with a higher frequency accounting for 50%. The proportion of some sentence patterns used in certain articles is much higher or lower than that in other articles. These researches achievements provide the basis for the establishment of the feature vectors of the sentence pattern in the article and offers a basis for feature extraction and application of articles in the later period.
机译:在数据技术时代,数据的特点是规模巨大,模式多样且增长迅速。与汉语相关的语料库的价值也相应地通过相乘而增加。基于一种称为语言技术平台(LTP)的中文处理系统,使用数据挖掘和机器学习来提取和应用中文句子特征是中文信息处理领域的新视角和切入点。本文选择了依存语法进行句子模式分析,建立了由序列和向量组成的文本表示模型。建立了专门的“汉语句型检索库”,其中包括1,032,480个句子和92,451种句型,为更特殊的句型研究提供了句型数据库服务。在此数据库的基础上,对不同体裁文章的句型进行了相关统计和初步分析。结果发现,汉语中的核心句型约有2,000种,常用句型相对集中,频率较高的10种句型占50%。在某些文章中使用的某些句子模式所占的比例远高于或低于其他文章。这些研究成果为在文章中建立句型特征向量提供了基础,并为以后的文章特征提取和应用奠定了基础。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号