How to enable efficient analytics over such data has been an increasinglyimportant research problem. Given the sheer size of such social networks, manyexisting studies resort to sampling techniques that draw random nodes from anonline social network through its restrictive web/API interface. Almost all ofthem use the exact same underlying technique of random walk - a Markov ChainMonte Carlo based method which iteratively transits from one node to its randomneighbor. Random walk fits naturally with this problem because, for most online socialnetworks, the only query we can issue through the interface is to retrieve theneighbors of a given node (i.e., no access to the full graph topology). Aproblem with random walks, however, is the "burn-in" period which requires alarge number of transitions/queries before the sampling distribution convergesto a stationary value that enables the drawing of samples in a statisticallyvalid manner. In this paper, we consider a novel problem of speeding up the fundamentaldesign of random walks (i.e., reducing the number of queries it requires)without changing the stationary distribution it achieves - thereby enabling amore efficient "drop-in" replacement for existing sampling-based analyticstechniques over online social networks. Our main idea is to leverage thehistory of random walks to construct a higher-ordered Markov chain. We developtwo algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRWand GNRW) and prove that, no matter what the social network topology is, CNRWand GNRW offer better efficiency than baseline random walks while achieving thesame stationary distribution. We demonstrate through extensive experiments onreal-world social networks and synthetic graphs the superiority of ourtechniques over the existing ones.
展开▼