首页> 外文会议>International Conference on Information Communication and Embedded Systems >Document grouping with concept based discriminative analysis and feature partition
【24h】

Document grouping with concept based discriminative analysis and feature partition

机译:文档分组与基于概念的鉴别分析和特征分区

获取原文

摘要

Clustering is one of the most important techniques in machine learning and data mining responsibilities. Similar documents are grouped by performing clustering techniques. Similarity measure is used to determine transaction associations. Hierarchical clustering method produces tree structured results. Partition based clustering model produces the results in grid format. Text documents are formless data values with high dimensional attributes. Document clustering group the unlabeled text documents into meaningful clusters. Traditionally clustering methods need cluster count (K) before the document grouping process. Clustering accuracy decreases drastically with reference to the unsuitable cluster count. Document word features are automatically partitioned into two groups discriminative words and non-discriminative words. But only discriminative words are useful for grouping documents. The contribution of nondiscriminative words confuses the clustering process and leads to poor cluster solutions. The variational inference algorithm is used to infer the document collection structure and partition of document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition documents. DPM clustering model utilizes both the data likelihood and the clustering property of the Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to discover the latent cluster structure based on the DPM model. DPMFP clustering model is performed without requiring the no. of clusters as input. The Discriminative word identification process is enhanced with the labeled document analysis mechanism. The concept relationships are analyzed with Ontology support. Semantic weight analysis is used for the document similarity measure. This method increases the scalability with the support of labels and concept relations for dimensionality cutback process.
机译:聚类是机器学习和数据挖掘职责最重要的技术之一。通过执行聚类技术来分组类似的文档。相似度测量用于确定事务关联。分层群集方法生成树结构结果。基于分区的聚类模型以网格格式生成结果。文本文档是具有高维属性的无形数据值。文档群集将未标记的文本文本组分为有意义的集群。传统上群集方法需要在文档分组过程之前群集计数(k)。参考不合适的群集计数,聚类精度急剧下降。文档单词功能将自动分为两组识别单词和非歧视词。但只有鉴别性的单词对于分组文件很有用。非歧视性词的贡献使聚类过程困扰并导致群体解决方案不佳。变分推理算法用于在同一时间推断文档收集结构和文档单词的分区。 Dirichlet Process混合物(DPM)模型用于分区文档。 DPM群集模型利用Dirichlet进程(DP)的数据似然和群集属性。特征分区的Dirichlet Process混合模型(DPMFP)用于基于DPM模型发现潜在簇结构。 DPMFP聚类模型在不需要NO的情况下执行。群集作为输入。用标记的文档分析机制增强了鉴别的单词识别过程。通过本体支持分析概念关系。语义权重分析用于文档相似度测量。该方法增加了标签和概念关系的可扩展性,用于维度削减过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号