【24h】

An Estimation Method of the Words Tendency Based on Time-Series Variation

机译:基于时间序列变化的单词倾向性估计方法

获取原文
获取原文并翻译 | 示例

摘要

Recently, there are many electronic text and computers are more and more processing for them. Frequencies of the words in the texts change according to the time-series variation. Frequently, these words are considered as keywords because they have strong relationships with the subject of the texts. However, traditional document processing systems do not consider the time-series information of the words, during calculating their importance. This paper presents an estimation method of the word trend considering the time-series variation. First, we made an example, which show us the re-arrangement of the similar texts retrieval system by using traditional methods and after using the method of the word trend based on time-series variation. By using the decision tree, the proposed method of this paper classifies words into three classes: increasing, constant, and decreasing, which effect in the stability class of words. This classification is acquired by learning five attribute values of the words, such as: slope and slice of regression line, correlation coefficient, the angle between two regression straight lines, and some special nouns attributes, and then we estimate the class of the new words. These attribute values are defined in order to measure the frequency change of each word quantitatively, and we find that these attributes have efficiency on the behavior of recall and precision. Among the evaluation, we obtained the attribute values of 1,069 proper nouns extracted from 8,216 articles of CNN newspapers (1997-1999) "This data called Learning-Data", where these articles discuss about the professional baseball. By learning the attribute values to the decision tree, 696 proper nouns that extracted from 1,272 articles of CNN newspaper (2000) are classified "This data called Test-Data". According to comparing the decision tree results evaluation with human evaluation results, it is estimated that, F-measures of increasing class, constant-class, and decrease-class are 0.847,0.851, and 0.768 respectively.
机译:近来,有许多电子文本,计算机对它们的处理也越来越多。文本中单词的频率根据时间序列变化而变化。通常,这些词被视为关键字,因为它们与文本的主题有很强的关系。但是,传统的文档处理系统在计算单词的重要性时不会考虑单词的时间序列信息。本文提出了一种考虑时间序列变化的单词趋势估计方法。首先,我们举一个例子,向我们展示了使用传统方法对类似文本检索系统的重新排列,以及使用了基于时间序列变化的词趋向方法之后的例子。通过使用决策树,本文提出的方法将单词分为增加,恒定和减少三类,这会影响单词的稳定性。通过学习单词的五个属性值(例如:回归线的斜率和切片,相关系数,两条回归直线之间的角度以及一些特殊名词属性)来获得此分类,然后估计新单词的类别。定义这些属性值是为了定量地测量每个单词的频率变化,并且我们发现这些属性在召回和精确度方面具有效率。在评估中,我们获得了从CNN报纸的8,216篇文章(1997-1999年)“此数据称为学习数据”中提取的1,069个专有名词的属性值,其中这些文章讨论了职业棒球。通过学习决策树的属性值,将从CNN报纸(2000)的1,272篇文章中提取的696个专有名词分类为“此数据称为Test-Data”。通过将决策树结果评估与人类评估结果进行比较,估计增加等级,恒定等级和减少等级的F度量分别为0.847、0.851和0.768。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号