首页> 外文会议>International Conference on Recent Trends in Information Technology >Effect of multi-word features on the hierarchical clustering of web documents
【24h】

Effect of multi-word features on the hierarchical clustering of web documents

机译:多字功能对Web文档层次聚类的影响

获取原文
获取外文期刊封面目录资料

摘要

Contemporary search engines and other automated web tools are faced with the task of extracting relevant information from huge web archives. This is supposed to be a difficult task due to the semi-structured and unstructured nature of the web documents. Users need automated ways of organizing and cataloging the web documents so that they can be queried efficiently. Clustering is typically employed to organize web archives and to subsequently handle user queries. This paper analyzes the effect of including multi-word features on the performance of a hierarchical clustering algorithm. Noun sequences are the predominant features considered in our work, while most of the previous research uses n-grams as features. The paper also analyzes the effect of combining link and content based representations for the web documents and their inter-relationships on the clustering performance. Empirical evaluation of the hierarchical clustering engine suggests that including multi-word features enhances the performance of the hierarchical clustering algorithm with respect to precision.
机译:当代的搜索引擎和其他自动化网络工具面临着从庞大的网络档案中提取相关信息的任务。由于Web文档的半结构化和非结构化性质,这被认为是一项艰巨的任务。用户需要自动的方式来组织和分类Web文档,以便可以有效地查询它们。群集通常用于组织Web存档并随后处理用户查询。本文分析了包含多词特征对分层聚类算法性能的影响。名词序列是我们工作中考虑的主要特征,而先前的大多数研究都使用n-gram作为特征。本文还分析了将基于链接和内容的表示形式组合在一起的Web文档及其相互关系对聚类性能的影响。对分层聚类引擎的经验评估表明,包括多词功能可提高分层聚类算法在精度方面的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号