首页> 外文会议>ISKE 2013 >A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source
【24h】

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

机译:基于多相分层的基于分层的分层聚类,在深网络数据源上

获取原文

摘要

Compared with surface web, deep web stores more high-quality data, and data mining over deep web is more valuable. Nevertheless, in deep web, the entire data sets are stored in back-end databases and cannot be accessed directly, and data can only be retrieved over the Internet through query forms. The only particular method for mining a deep web data source is to sample the data set, which caused several unique challenges. In this paper, according to active learning, instead of traditional one-time sample allocation, we use multiple phases of sample allocation, which improves the representativeness of our gained samples. At the step of stratified sampling in each phase, we sample parts of representative samples for initial clustering. Using gained clusters, we can explore boundary points in them. A boundary point owns much uncertainty than others; for example, it contains more information. Sampling on a boundary point is useful to gain more representative samples. According to our experiments, our method performs better than random sampling and two-phase sampling in Liu and Agrawal (Int Conf Data Mining 70-81, 2012) at the same sampling costs.
机译:与表面的Web相比,Deep Web的商店更优质的数据和数据挖掘过深网是更有价值的。然而,在深层网络,整个数据集存储在后端数据库,不能直接访问,数据只能在互联网上通过查询形式进行检索。一种用于开采的Deep Web数据源的唯一特定的方法是采样数据集,这引起了一些独特的挑战。在本文中,根据主动学习,而不是传统的一次性样本分配,我们使用的样品分配,从而提高我们获得的样本的代表性的多个阶段。在分层抽样的每个阶段中的步骤中,我们采样代表性样品的零件初始聚类。使用获得的集群,我们可以在其中探索边界点。边界点拥有比别人多的不确定性;例如,它包含了更多的信息。采样上的边界点是为了获得更多的代表性样品有用。根据我们的实验,我们的方法比在相同的采样成本随机抽样和两相抽样刘和阿格拉瓦尔(智力CONF数据挖掘70-81,2012)更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号