首页> 外文会议>Computer Science On-line Conference >The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints
【24h】

The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints

机译:相关特征选择算法对某些社交网络文本挖掘观点的影响

获取原文
获取外文期刊封面目录资料

摘要

This research addresses a well-known problem in the area of text mining: The high computational complexity caused by many irrelevant features (terms, words), which may play an appreciable role of noise from the classification point of view and non-linearly rule the time and memory requirements. Using a set of real-world textual documents represented by sentiment related to three selected and extensively tracked Internet sources freely written in English, a group of available algorithms (Gain Ratio, Chi Square, Info Gain, Symmetrical Uncertainty, Winnow, One R, Relief F, Principal Components, SVM, LSA) applied to discovering relevant features was tested with 10,000, 25,000, and 50,000 social-network entries. All the algorithms provided very similar results concerning looking for the relevant features - typically, only the feature significance rank was slightly different. Except for some slower algorithms, the term-preselecting time ranged from seconds to minutes to a couple of hours. However, after using only a relevant fraction of features instead of all of them, the entry length very considerably decreased by several orders of magnitude, particularly for larger data sets having very high dimensionality degree. Despite the extremely strong reduction of the number of words, the classification accuracy remained the same independently on the relevant-feature selection algorithm choice.
机译:这项研究解决了文本挖掘领域的一个众所周知的问题:许多无关的特征(术语,单词)引起的高计算复杂性,这可能会从分类的角度和非线性规则中发挥着噪声的可观作用时间和内存要求。使用与三个选择和广泛的互联网源相关的情绪为代表的一组现实世界文本,通过英语自由编写,一组可用的算法(增益比,Chi Square,Info Gain,对称不确定性,WinNow,一个R,浮雕F,应用于发现相关功能的F,主成分,SVM,LSA)以10,000,25,000和50,000个社交网络条目进行了测试。所有算法都提供了非常相似的关于寻找相关特征的结果 - 通常,只有特征意义等级略有不同。除了一些较慢的算法外,术语预选择时间范围为几分钟到几个小时。然而,在仅使用相关的特征的相关分数之后,而不是所有的特征,则进入长度的数量级非常显着降低,特别是对于具有非常高维度的较大数据集。尽管单词数量极强,但在相关特征选择算法选择上,分类准确性仍然相同。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号