【24h】

Unsupervised Feature Selection for Text Data

机译:文本数据的无监督特征选择

获取原文
获取原文并翻译 | 示例

摘要

Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: CLUSTER divides the entire feature space, before then selecting one feature to represent each cluster; and GREEDY increments the feature subset size by a greedily selected feature. In particular we found that Greedy's local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both GREEDY and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.
机译:无人监督任务的特征选择特别具有挑战性,尤其是在处理文本数据时。在线文档和电子邮件通信的增长导致人们需要一种无需用户监督即可运行的工具。在本文中,我们着眼于满足这一需求的新颖特征选择技术。信息论中的分布相似性度量被用于度量特征效用。该实用程序通过两种互补的方式通知搜索代表性特征和多样化特征:CLUSTER划分了整个特征空间,然后选择一个特征表示每个聚类; GREEDY通过贪婪选择的特征来增加特征子集的大小。特别是,我们发现Greedy的本地搜索适合于学习较小的特征子集大小,而Cluster可以提高较大特征集的整体质量。与四个基于电子邮件的数据集进行的实验表明,与现有基于频率的方法相比,基于最近邻居的搜索方法在检索准确性方面有了显着提高。重要的是,GREEDY和Cluster都在通过标准的受监督特征选择方法设定的上限性能方面取得了重大进展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号