首页> 外文学位 >Incorporating background knowledge in document clustering.
【24h】

Incorporating background knowledge in document clustering.

机译:将背景知识纳入文档聚类。

获取原文
获取原文并翻译 | 示例

摘要

The explosive growth of unstructured text data in the present digital age has triggered an overwhelming interest in the development of robust and scalable document clustering techniques that can automatically partition and summarize the large tracts of documents. As document clustering is an unsupervised learning task, the quality of the partitions may be suboptimal due to the lack of guidance about which documents belong together in the same cluster. Augmenting the clustering algorithm with additional side information may potentially lead to better clusters. Towards this end, this thesis focuses on the use of background knowledge from an ontology such as WordNet to enhance the performance of document clustering algorithms. There are numerous challenges that must be overcome in order for such an approach to be successful. Most notably, how to effectively map the original words in the documents to their corresponding concepts in an ontology? The strategy used for concept mapping is important because it may increase the dimensionality of the data or introduce erroneous concepts, both of which have an adverse effect on the quality of the final partitions. In addition, the choice of ontology is another factor that should be taken into consideration since each ontology has its own structure, coverage, and content. Despite these challenges, a considerable amount of research has been done over the past decade on ontology-driven clustering. Yet the results from previous studies have not been conclusive. Some concluded that ontology helps improve clustering performance while others showed it is not that helpful. This thesis investigates the various factors that affect the performance of such clustering algorithms, including the choice of ontology, concept mapping approach, and benchmark datasets and baseline algorithms used for evaluation. The contributions of this thesis are as follows: First, a noun-based approach is proposed as a simple but more stringent baseline for clustering. Second, a novel unsupervised information gain approach is developed for extracting a core subset of semantic features from an ontology that can be effectively used for clustering. Third, a hybrid ontology-driven ensemble clustering method is proposed that combines the clusters of nouns and clusters of concepts extracted from an ontology. Finally, an approach for extracting concepts from Wikipedia is proposed and compared against existing works. These concepts are then used in conjunction with the concepts (synsets) from WordNet to study the effect of applying multiple ontologies on document clustering.
机译:在当今的数字时代,非结构化文本数据的爆炸性增长引发了人们对强大,可扩展的文档聚类技术的发展的浓厚兴趣,该技术可以自动分区和汇总大量文档。由于文档聚类是一项无监督的学习任务,因此,由于缺乏有关哪些文档在同一聚类中属于同一类的指导,因此分区的质量可能欠佳。用附加的辅助信息增强聚类算法可能会导致更好的聚类。为此,本文着重于利用诸如WordNet之类的本体中的背景知识来增强文档聚类算法的性能。为了使这种方法成功,必须克服许多挑战。最值得注意的是,如何有效地将文档中的原始单词映射到本体中的相应概念?用于概念映射的策略很重要,因为它可能会增加数据的维数或引入错误的概念,这两者都会对最终分区的质量产生不利影响。另外,由于每个本体都有自己的结构,覆盖范围和内容,因此本体选择是另一个要考虑的因素。尽管存在这些挑战,但是在过去的十年中,已经进行了大量有关本体驱动的聚类的研究。然而,先前研究的结果尚无定论。一些人得出结论,本体可以帮助提高聚类性能,而另一些人则表明它没有帮助。本文研究了影响此类聚类算法性能的各种因素,包括本体的选择,概念映射方法以及用于评估的基准数据集和基线算法。本论文的主要贡献如下:首先,提出了一种基于名词的方法作为聚类的简单但更为严格的基准。其次,开发了一种新颖的无监督信息获取方法,该方法用于从可以有效用于聚类的本体中提取语义特征的核心子集。第三,提出了一种混合的本体驱动的集成聚类方法,该方法将名词的聚类和从本体中提取的概念的聚类相结合。最后,提出了一种从Wikipedia中提取概念的方法,并将其与现有作品进行了比较。然后,将这些概念与WordNet中的概念(同义词集)结合使用,以研究在文档集群上应用多种本体的效果。

著录项

  • 作者

    Fodeh, Samah Jamal.;

  • 作者单位

    Michigan State University.;

  • 授予单位 Michigan State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 134 p.
  • 总页数 134
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号