We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as O(N~(-1) + (N/m)~(-2)). Whenever m ≤ N~(1/2), this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N~(-1) + (N/m)~(-3)), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on large-scale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 10~8 samples and d ≥ 700,000 dimensions.
展开▼