首页> 外文会议>International conference on intelligent text processing and computational linguistics >Website Community Mining from Query Logs with Two-Phase Clustering
【24h】

Website Community Mining from Query Logs with Two-Phase Clustering

机译:具有两阶段聚类的查询日志中的网站社区挖掘

获取原文

摘要

A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information about several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result in a particular website clicked can be regarded as the summarization of that website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.
机译:网站社区是指专注于相同或相似主题的一组网站。网站社区挖掘任务有两个主要挑战。首先,出于竞争的考虑,同一主题中的网站之间可能没有直接链接。其次,一个网站可能包含有关多个主题的信息。因此,网站社区挖掘方法应能够捕获此类现象并将此类网站分配给不同的社区。在本文中,我们提出了一种利用Web搜索中的查询日志数据自动挖掘网站社区的方法。查询日志数据可以看作是对真实Web的全面总结。导致特定网站被点击的查询可以视为该网站内容的汇总。同一主题中的网站通过传达该主题中所需信息的查询间接连接。这种观察可以帮助我们克服第一个挑战。提出的两阶段方法可以解决第二个挑战。在第一阶段,我们将同一主机的查询聚类以获得主机的不同内容。在第二阶段,我们进一步将来自不同主机的内容集聚在一起。由于两阶段群集,一台主机可能会出现在多个网站社区中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号