...
首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Discovering Data Set Nature through Algorithmic Clustering Based on String Compression
【24h】

Discovering Data Set Nature through Algorithmic Clustering Based on String Compression

机译:通过基于字符串压缩的算法聚类发现数据集性质

获取原文
获取原文并翻译 | 示例
           

摘要

Text data sets can be represented using models that do not preserve text structure, or using models that preserve text structure. Our hypothesis is that depending on the data set nature, there can be advantages using a model that preserves text structure over one that does not, and vice versa. The key is to determine the best way of representing a particular data set, based on the data set itself. In this work, we proposde B``orjae to investigate this problem by combining text distortion and algorithmic clustering based on string compression. Specifically, a distortion technique previously developed by the authors is applied to destroy text structureprogressively. Following this, a clustering algorithm based on string compression is used to analyze the effects of the distortion on the information contained in the texts. Several experiments are carried out on text data sets and artificially-generated data sets. The results show that in strongly structural data sets the clustering results worsen as text structure is progressively destroyed. Besides, they show that using a compressor which enables the choice of the size of the left-context symbols helps to determine the nature of the data sets. Finally, the results are contrasted with a method based on multidimensional projections and analogous conclusions are obtained.
机译:可以使用不保留文本结构的模型或保留文本结构的模型来表示文本数据集。我们的假设是,根据数据集的性质,使用保留文本结构的模型相对于保留不保留文本结构的模型可能会有优势,反之亦然。关键是根据数据集本身来确定表示特定数据集的最佳方式。在这项工作中,我们建议B''orjae通过结合文本失真和基于字符串压缩的算法聚类来研究此问题。具体而言,作者先前开发的一种失真技术被应用于逐步破坏文本结构。此后,基于字符串压缩的聚类算法用于分析失真对文本中包含的信息的影响。在文本数据集和人工生成的数据集上进行了一些实验。结果表明,在强结构数据集中,聚类结果随着文本结构的逐渐破坏而恶化。此外,他们还表明,使用压缩器可以选择左上下文符号的大小,这有助于确定数据集的性质。最后,将结果与基于多维投影的方法进行对比,得出类似的结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号