首页> 外文期刊>International Journal of Data Science and Analytics >An effective approach for semantic-based clustering and topic-based ranking of web documents
【24h】

An effective approach for semantic-based clustering and topic-based ranking of web documents

机译:Web文档基于语义的聚类和基于主题的排名的有效方法

获取原文
获取原文并翻译 | 示例
       

摘要

In this large, dynamic and expandable web, extracting desired information of any user query is a significant problem for the search engine. Clustering and Ranking are two important resources which can shed light in this direction. To achieve this potential clustering-ranking mechanism, this study proposes a combined approach of semantic-based clustering and topic-based ranking of web documents. The proposed clustering approach combines the latent semantic indexing (LSI) with min-cut algorithm. To make the clustering technique more effective, a new feature selection method called clustering-based feature selection has been developed that focuses on finding the feature set which gathers the crux of documents in the corpus without deteriorating the outcome of the construction process. While LSI completely overcomes the constraint of synonymy, the min-cut algorithm helps to generate efficient clusters at each stage of the clustering process. For deciding the number of clusters to be formed, silhouette coefficient is used, which is a parameter incorporating both cohesion and separation of clusters. To rank the documents in each semantic cluster, the proposed approach transforms the text into topics using latent Dirichlet allocation and then runs the inverted indexing technique on those topics. 20-Newsgroups and DMOZ datasets are used for experimental work, and the results obtained from the experiment show that the performance of the clustering approach is better than the traditional clustering approaches and the ranking approach is promising.
机译:在这个庞大,动态且可扩展的网络中,提取任何用户查询的所需信息对于搜索引擎而言是一个重大问题。聚类和排名是可以朝这个方向阐明的两个重要资源。为了实现这种潜在的聚类排名机制,本研究提出了一种基于语义的聚类和基于主题的Web文档排名的组合方法。提出的聚类方法结合了潜在语义索引(LSI)和最小割算法。为了使聚类技术更有效,已经开发了一种称为基于聚类的特征选择的新特征选择方法,该方法着眼于寻找在不降低构造过程结果的情况下收集语料库中关键点的特征集。 LSI完全克服了同义性的限制,而最小割算法有助于在聚类过程的每个阶段生成有效的聚类。为了确定要形成的簇的数量,使用了轮廓系数,其为结合了簇的内聚力和分离力的参数。为了对每个语义簇中的文档进行排名,所提出的方法使用潜在的狄利克雷分配将文本转换为主题,然后对这些主题运行反向索引技术。 20-Newsgroups和DMOZ数据集用于实验工作,从实验中获得的结果表明,聚类方法的性能优于传统聚类方法,并且排序方法很有希望。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号