Counting Motifs with Graph Sampling

Jason M. Klusowski; Yihong Wu

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Counting Motifs with Graph Sampling

【24h】

Counting Motifs with Graph Sampling

机译：用图形采样计算主题

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where we additionally observe the edges between the sampled vertices and their neighbors. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $G$ with maximal degree $d$, for any connected motif $h$ on $k$ vertices, to estimate the number of copies of $h$ in $G$, denoted by $s=mathsf{s}(h,G)$, with a multiplicative error of $epsilon$, For subgraph sampling, the optimal sampling ratio $p$ is $Theta_{k}(max{ (sepsilon^2)^{-rac{1}{k}}, ; rac{d^{k-1}}{sepsilon^{2}} })$, which only depends on the size of the motif but not its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs.For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieving the sampling ratio $O_{k}(min{ (rac{d}{sepsilon^2})^{rac{1}{k-1}}, ; sqrt{rac{d^{k-2}}{sepsilon^2}}})$, which again only depends on the size of $h$. This is shown to be optimal for all motifs with at most $4$ vertices and cliques of all sizes. The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.

机译：应用研究人员通常根据从随机节点样本中收集的数据来构建网络，目的是从采样版本中推断出父网络的属性。两种最广泛使用的采样方案是子图采样，在子图采样中，我们以概率$ p $独立采样每个顶点，并观察由采样顶点诱导的子图;在邻域采样中，我们还观察到了采样顶点与其相邻像素之间的边缘。在本文中，我们从统计的角度研究了在两个模型下估计作为诱导子图的图案数量的问题。我们显示：对于最大度为d $的父图$ G $，对于$ k $顶点上的任何连通基元$ h $，估计$ G $中$ h $的副本数，表示为$ s = mathsf {s}（h，G）$，具有$ epsilon $的乘法误差，对于子图采样，最佳采样率$ p $为$ Theta_ {k}（ max {（s epsilon ^ 2）^ {- frac {1} {k}}，; frac {d ^ {k-1}} {s epsilon ^ {2}} }）$，这仅取决于主题，但不是其实际拓扑。此外，我们证明了Horvitz-Thompson类型的估计器对于任何相连的主题都是普遍最优的。对于邻域采样，我们提出了一个估计器族，涵盖并超越了Horvitz-Thompson估计器，并实现了采样率$ O_ {k}（ min {（ frac {d} {s epsilon ^ 2}）^ { frac {1} {k-1}}，; sqrt { frac {d ^ {k-2}} {s epsilon ^ 2}} }）$，这再次仅取决于$ h $的大小。已显示这是最适合所有最多有$ 4 $顶点和所有大小的集团的所有主题的。使用子图计数的某些代数属性建立匹配的minimax下界。这些结果使我们能够量化比子图采样多得多的信息性邻域采样，正如通过对合成数据和真实世界数据进行的实验所证实的那样。我们还将解决适应最大未知程度的问题，并研究具有其他结构（例如树或平面图）的父图的特定问题。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2018年第2010期|共46页
作者
Jason M. Klusowski; Yihong Wu;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams [J] . De Stefani Lorenzo, Terolli Erisa, Upfal Eli ACM transactions on knowledge discovery from data . 2021,第5期

机译：分层采样：在大规模图形流中计算稀疏图案的有效方法
2. Counting Motifs with Graph Sampling [J] . Jason M. Klusowski, Yihong Wu JMLR: Workshop and Conference Proceedings . 2018,第3期

机译：用图形采样计算主题
3. Counting Motifs with Graph Sampling [J] . Jason M. Klusowski, Yihong Wu JMLR: Workshop and Conference Proceedings . 2017,第2期

机译：用图形采样计算主题
4. Tiered sampling: An efficient method for approximate counting sparse motifs in massive graph streams [C] . Lorenzo De Stefani, Erisa Terolli, Eli Upfal IEEE International Conference on Big Data . 2017

机译：分层采样：一种有效的方法，用于对大量图形流中的稀疏主题进行近似计数
5. Improved Triangle Counting in Graph Streams: Neighborhood Multi-sampling [D] . Hanjani, Kiana Mousavi 2018

机译：改进的图形流中的三角形计数：邻域多重采样
6. Single‐Molecule Counting of High‐Sensitivity Troponin I in Patients Referred for Diagnostic Angiography: Results From the CASABLANCA (Catheter Sampled Blood Archive in Cardiovascular Diseases) Study [O] . Cian P. McCarthy, Nasrien E. Ibrahim, Asya Lyass, 2018

机译：被诊断为血管造影的患者中高敏感性肌钙蛋白I的单分子计数：CASABLANCA（心血管疾病中的导管采样血液档案）研究的结果
7. Tiered Sampling: An Efficient Method for Approximate Counting Sparse Motifs in Massive Graph Streams [O] . De Stefani, Lorenzo, Terolli, Erisa, Upfal, Eli 2017

机译：分层抽样：一种有效的近似计数稀疏方法大规模图流中的图案

Counting Motifs with Graph Sampling

摘要

著录项

相似文献

相关主题

期刊订阅