首页> 外文期刊>Engineering Applications of Artificial Intelligence >Nonlinear transformation of term frequencies for term weighting in text categorization
【24h】

Nonlinear transformation of term frequencies for term weighting in text categorization

机译:文本分类中术语加权的术语频率的非线性变换

获取原文
获取原文并翻译 | 示例

摘要

In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F_1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F_1 scores are achieved on majority of the datasets when smaller number of features are considered.
机译:在自动文本分类中,特征对决策的影响由术语权重设置,这些权重通常按术语频率和收集频率因子的乘积计算。术语频率的原始形式或其对数形式通常用作术语频率因子,而前导收集频率因子考虑到每个术语的文档频率。在这项研究中,首先表明,术语频率因子的最佳拟合形式取决于所关注数据集中术语频率值的分布。考虑到这一观察结果,提出了一种新的考虑频率项的收集频率因数。首先测试了五个数据集,以显示术语频率值的分布与任务有关。与大多数考虑的数据集上的两种最新方法相比,该方法被证明可以提供更好的F_1分数。可以确定的是,在不涉及高度重复项的任务中,在收集频率因子中使用项频率是有益的。还显示出,当考虑较少数量的特征时,大多数数据集都获得了最佳的F_1分数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号