首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Counting Motifs with Graph Sampling
【24h】

Counting Motifs with Graph Sampling

机译:用图形采样计算主题

获取原文
       

摘要

Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where we additionally observe the edges between the sampled vertices and their neighbors. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $G$ with maximal degree $d$, for any connected motif $h$ on $k$ vertices, to estimate the number of copies of $h$ in $G$, denoted by $s=mathsf{s}(h,G)$, with a multiplicative error of $epsilon$, For subgraph sampling, the optimal sampling ratio $p$ is $Theta_{k}(max{ (sepsilon^2)^{-rac{1}{k}}, ; rac{d^{k-1}}{sepsilon^{2}} })$, which only depends on the size of the motif but not its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs.For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieving the sampling ratio $O_{k}(min{ (rac{d}{sepsilon^2})^{rac{1}{k-1}}, ; sqrt{rac{d^{k-2}}{sepsilon^2}}})$, which again only depends on the size of $h$. This is shown to be optimal for all motifs with at most $4$ vertices and cliques of all sizes. The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.
机译:应用研究人员通常根据从随机节点样本中收集的数据来构建网络,目的是从采样版本中推断出父网络的属性。两种最广泛使用的采样方案是子图采样,在子图采样中,我们以概率$ p $独立采样每个顶点,并观察由采样顶点诱导的子图;在邻域采样中,我们还观察到了采样顶点与其相邻像素之间的边缘。在本文中,我们从统计的角度研究了在两个模型下估计作为诱导子图的图案数量的问题。我们显示:对于最大度为d $的父图$ G $,对于$ k $顶点上的任何连通基元$ h $,估计$ G $中$ h $的副本数,表示为$ s = mathsf {s}(h,G)$,具有$ epsilon $的乘法误差,对于子图采样,最佳采样率$ p $为$ Theta_ {k}( max {(s epsilon ^ 2)^ {- frac {1} {k}},; frac {d ^ {k-1}} {s epsilon ^ {2}} })$,这仅取决于主题,但不是其实际拓扑。此外,我们证明了Horvitz-Thompson类型的估计器对于任何相连的主题都是普遍最优的。对于邻域采样,我们提出了一个估计器族,涵盖并超越了Horvitz-Thompson估计器,并实现了采样率$ O_ {k}( min {( frac {d} {s epsilon ^ 2})^ { frac {1} {k-1}},; sqrt { frac {d ^ {k-2}} {s epsilon ^ 2}} })$,这再次仅取决于$ h $的大小。已显示这是最适合所有最多有$ 4 $顶点和所有大小的集团的所有主题的。使用子图计数的某些代数属性建立匹配的minimax下界。这些结果使我们能够量化比子图采样多得多的信息性邻域采样,正如通过对合成数据和真实世界数据进行的实验所证实的那样。我们还将解决适应最大未知程度的问题,并研究具有其他结构(例如树或平面图)的父图的特定问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号