TY - JOUR
AB - As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.
AU - Ramezani-Kebrya, Ali
AU - Faghri, Fartash
AU - Markov, Ilya
AU - Aksenov, Vitalii
AU - Alistarh, Dan-Adrian
AU - Roy, Daniel M.
ID - 9571
IS - 114
JF - Journal of Machine Learning Research
SN - 15324435
TI - NUQSGD: Provably communication-efficient data-parallel SGD via nonuniform quantization
VL - 22
ER -
TY - CONF
AB - In this note, we introduce a distributed twist on the classic coupon collector problem: a set of m collectors wish to each obtain a set of n coupons; for this, they can each sample coupons uniformly at random, but can also meet in pairwise interactions, during which they can exchange coupons. By doing so, they hope to reduce the number of coupons that must be sampled by each collector in order to obtain a full set. This extension is natural when considering real-world manifestations of the coupon collector phenomenon, and has been remarked upon and studied empirically (Hayes and Hannigan 2006, Ahmad et al. 2014, Delmarcelle 2019).
We provide the first theoretical analysis for such a scenario. We find that “coupon collecting with friends” can indeed significantly reduce the number of coupons each collector must sample, and raises interesting connections to the more traditional variants of the problem. While our analysis is in most cases asymptotically tight, there are several open questions raised, regarding finer-grained analysis of both “coupon collecting with friends,” and of a long-studied variant of the original problem in which a collector requires multiple full sets of coupons.
AU - Alistarh, Dan-Adrian
AU - Davies, Peter
ID - 9620
SN - 0302-9743
T2 - Structural Information and Communication Complexity
TI - Collecting coupons is faster with friends
VL - 12810
ER -
TY - CONF
AB - We consider the problem ofdistributed mean estimation (DME), in which n machines are each given a local d-dimensional vector xv∈Rd, and must cooperate to estimate the mean of their inputs μ=1n∑nv=1xv, while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants of this problem, especially in the context of distributed variance reduction for stochastic gradients in parallel SGD. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an error bound in terms of this norm. However, in many real applications, the input vectors are concentrated around the correct output μ, but μ itself has large norm. In such cases, previous output error bounds perform poorly. In this paper, we show that output error bounds need not depend on input norm. We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction. The technique is based on a new connection with lattice theory. We also provide lower bounds showing that the communication to error trade-off of our algorithms is asymptotically optimal. As the lattices achieving optimal bounds under l2-norm can be computationally impractical, we also present an extension which leverages easy-to-use cubic lattices, and is loose only up to a logarithmic factor ind. We show experimentally that our method yields practical improvements for common applications, relative to prior approaches.
AU - Davies, Peter
AU - Gurunanthan, Vijaykrishna
AU - Moshrefi, Niusha
AU - Ashkboos, Saleh
AU - Alistarh, Dan-Adrian
ID - 9543
T2 - 9th International Conference on Learning Representations
TI - New bounds for distributed mean estimation and variance reduction
ER -
TY - CONF
AB - Approximate agreement is one of the few variants of consensus that can be solved in a wait-free manner in asynchronous systems where processes communicate by reading and writing to shared memory. In this work, we consider a natural generalisation of approximate agreement on arbitrary undirected connected graphs. Each process is given a vertex of the graph as input and, if non-faulty, must output a vertex such that
all the outputs are within distance 1 of one another, and
each output value lies on a shortest path between two input values.
From prior work, it is known that there is no wait-free algorithm among 𝑛≥3 processes for this problem on any cycle of length 𝑐≥4 , by reduction from 2-set agreement (Castañeda et al. 2018).
In this work, we investigate the solvability and complexity of this task on general graphs. We give a new, direct proof of the impossibility of approximate agreement on cycles of length 𝑐≥4 , via a generalisation of Sperner’s Lemma to convex polygons. We also extend the reduction from 2-set agreement to a larger class of graphs, showing that approximate agreement on these graphs is unsolvable. On the positive side, we present a wait-free algorithm for a class of graphs that properly contains the class of chordal graphs.
AU - Alistarh, Dan-Adrian
AU - Ellen, Faith
AU - Rybicki, Joel
ID - 9823
SN - 03029743
T2 - Structural Information and Communication Complexity
TI - Wait-free approximate agreement on graphs
VL - 12810
ER -
TY - JOUR
AB - Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1× on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).
AU - Li, Shigang
AU - Tal Ben-Nun, Tal Ben-Nun
AU - Nadiradze, Giorgi
AU - Girolamo, Salvatore Di
AU - Dryden, Nikoli
AU - Alistarh, Dan-Adrian
AU - Hoefler, Torsten
ID - 8723
IS - 7
JF - IEEE Transactions on Parallel and Distributed Systems
SN - 10459219
TI - Breaking (global) barriers in parallel stochastic optimization with wait-avoiding group averaging
VL - 32
ER -
TY - CONF
AB - There has recently been a surge of interest in the computational and complexity properties of the population model, which assumes n anonymous, computationally-bounded nodes, interacting at random, with the goal of jointly computing global predicates. Significant work has gone towards investigating majority or consensus dynamics in this model: that is, assuming that every node is initially in one of two states X or Y, determine which state had higher initial count.
In this paper, we consider a natural generalization of majority/consensus, which we call comparison : in its simplest formulation, we are given two baseline states, X and Y, present in any initial configuration in fixed, but possibly small counts. One of these states has higher count than the other: we will assume |X_0| > C |Y_0| for some constant C > 1. The challenge is to design a protocol by which nodes can quickly and reliably decide on which of the baseline states X_0 and Y_0 has higher initial count. We begin by analyzing a simple and general dynamics solving the above comparison problem, which uses O( log n ) states per node, and converges in O(log n) (parallel) time, with high probability, to a state where the whole population votes on opinions X or Y at rates proportional to the initial concentrations of |X_0| vs. |Y_0|. We then describe how this procedure can be bootstrapped to solve comparison, i.e. have every node in the population reach the "correct'' decision, with probability 1 - o(1), at the cost of O (log log n) additional states. Further, we prove that this dynamics is self-stabilizing, in the sense that it converges to the correct decision from arbitrary initial states, and leak-robust, in the sense that it can withstand spurious faulty reactions, which are known to occur in practical implementations of population protocols. Our analysis is based on a new martingale concentration result relating the discrete-time evolution of a population protocol to its expected (steady-state) analysis, which should be a useful tool when analyzing opinion dynamics and epidemic dissemination in the population model.
AU - Alistarh, Dan-Adrian
AU - Töpfer, Martin
AU - Uznański, Przemysław
ID - 9951
SN - 9781450385480
T2 - Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing
TI - Comparison dynamics in population protocols
ER -
TY - CONF
AB - There has been a significant amount of research on hardware and software support for efficient concurrent data structures; yet, the question of how to build correct, simple, and scalable data structures has not yet been definitively settled. In this paper, we revisit this question from a minimalist perspective, and ask: what is the smallest amount of synchronization required for correct and efficient concurrent search data structures, and how could this minimal synchronization support be provided in hardware?
To address these questions, we introduce memory tagging, a simple hardware mechanism which enables the programmer to "tag" a dynamic set of memory locations, at cache-line granularity, and later validate whether the memory has been concurrently modified, with the possibility of updating one of the underlying locations atomically if validation succeeds. We provide several examples showing that this mechanism can enable fast and arguably simple concurrent data structure designs, such as lists, binary search trees, balanced search trees, range queries, and Software Transactional Memory (STM) implementations. We provide an implementation of memory tags in the Graphite multi-core simulator, showing that the mechanism can be implemented entirely at the level of L1 cache, and that it can enable non-trivial speedups versus existing implementations of the above data structures.
AU - Alistarh, Dan-Adrian
AU - Brown, Trevor A
AU - Singhal, Nandini
ID - 8191
IS - 7
SN - 9781450369350
T2 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
TI - Memory tagging: Minimalist synchronization for scalable concurrent data structures
ER -
TY - JOUR
AB - Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized IHT that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speedup with negligible loss of recovery quality.
AU - Gurel, Nezihe Merve
AU - Kara, Kaan
AU - Stojanov, Alen
AU - Smith, Tyler
AU - Lemmin, Thomas
AU - Alistarh, Dan-Adrian
AU - Puschel, Markus
AU - Zhang, Ce
ID - 8268
JF - IEEE Transactions on Signal Processing
SN - 1053587X
TI - Compressive sensing using iterative hard thresholding with low precision data representation: Theory and applications
VL - 68
ER -
TY - CONF
AB - We consider the following dynamic load-balancing process: given an underlying graph G with n nodes, in each step t≥ 0, one unit of load is created, and placed at a randomly chosen graph node. In the same step, the chosen node picks a random neighbor, and the two nodes balance their loads by averaging them. We are interested in the expected gap between the minimum and maximum loads at nodes as the process progresses, and its dependence on n and on the graph structure. Variants of the above graphical balanced allocation process have been studied previously by Peres, Talwar, and Wieder [Peres et al., 2015], and by Sauerwald and Sun [Sauerwald and Sun, 2015]. These authors left as open the question of characterizing the gap in the case of cycle graphs in the dynamic case, where weights are created during the algorithm’s execution. For this case, the only known upper bound is of 𝒪(n log n), following from a majorization argument due to [Peres et al., 2015], which analyzes a related graphical allocation process. In this paper, we provide an upper bound of 𝒪 (√n log n) on the expected gap of the above process for cycles of length n. We introduce a new potential analysis technique, which enables us to bound the difference in load between k-hop neighbors on the cycle, for any k ≤ n/2. We complement this with a "gap covering" argument, which bounds the maximum value of the gap by bounding its value across all possible subsets of a certain structure, and recursively bounding the gaps within each subset. We provide analytical and experimental evidence that our upper bound on the gap is tight up to a logarithmic factor.
AU - Alistarh, Dan-Adrian
AU - Nadiradze, Giorgi
AU - Sabour, Amirmojtaba
ID - 8286
SN - 18688969
T2 - 47th International Colloquium on Automata, Languages, and Programming
TI - Dynamic averaging load balancing on cycles
VL - 168
ER -
TY - CONF
AB - We introduce extension-based proofs, a class of impossibility proofs that includes valency arguments. They are modelled as an interaction between a prover and a protocol. Using proofs based on combinatorial topology, it has been shown that it is impossible to deterministically solve k-set agreement among n > k ≥ 2 processes in a wait-free manner. However, it was unknown whether proofs based on simpler techniques were possible. We explain why this impossibility result cannot be obtained by an extension-based proof and, hence, extension-based proofs are limited in power.
AU - Alistarh, Dan-Adrian
AU - Aspnes, James
AU - Ellen, Faith
AU - Gelashvili, Rati
AU - Zhu, Leqi
ID - 8383
SN - 9781450375825
T2 - Proceedings of the 39th Symposium on Principles of Distributed Computing
TI - Brief Announcement: Why Extension-Based Proofs Fail
ER -
TY - CONF
AB - Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
AU - Li, Shigang
AU - Tal Ben-Nun, Tal Ben-Nun
AU - Girolamo, Salvatore Di
AU - Alistarh, Dan-Adrian
AU - Hoefler, Torsten
ID - 8722
T2 - Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
TI - Taming unbalanced training workloads in deep learning with partial collective operations
ER -
TY - CONF
AB - Union-Find (or Disjoint-Set Union) is one of the fundamental problems in computer science; it has been well-studied from both theoretical and practical perspectives in the sequential case. Recently, there has been mounting interest in analyzing this problem in the concurrent scenario, and several asymptotically-efficient algorithms have been proposed. Yet, to date, there is very little known about the practical performance of concurrent Union-Find. This work addresses this gap. We evaluate and analyze the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks, as well as integrations into more complex algorithms). We first observe that, due to the limited computational cost, the number of induced cache misses is the critical determining factor for the performance of existing algorithms. We introduce new techniques to reduce this cost by storing node priorities implicitly and by using plain reads and writes in a way that does not affect the correctness of the algorithms. Finally, we show that Union-Find implementations are an interesting application for Transactional Memory (TM): one of the fastest algorithm variants we discovered is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability.
AU - Alistarh, Dan-Adrian
AU - Fedorov, Alexander
AU - Koval, Nikita
ID - 7605
SN - 18688969
T2 - 23rd International Conference on Principles of Distributed Systems
TI - In search of the fastest concurrent union-find algorithm
VL - 153
ER -
TY - CONF
AB - Concurrent programming can be notoriously complex and error-prone. Programming bugs can arise from a variety of sources, such as operation re-reordering, or incomplete understanding of the memory model. A variety of formal and model checking methods have been developed to address this fundamental difficulty. While technically interesting, existing academic methods are still hard to apply to the large codebases typical of industrial deployments, which limits their practical impact.
AU - Koval, Nikita
AU - Sokolova, Mariia
AU - Fedorov, Alexander
AU - Alistarh, Dan-Adrian
AU - Tsitelov, Dmitry
ID - 7635
SN - 9781450368186
T2 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
TI - Testing concurrency on the JVM with Lincheck
ER -
TY - CONF
AB - Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexity and better performance. Although interpolation-based data structures were investigated in the past, their non-blocking concurrent variants have received very little attention so far.
In this paper, we propose the first non-blocking implementation of the classic interpolation search tree (IST) data structure. For arbitrary key distributions, the data structure ensures worst-case O(log n + p) amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected O(log log n + p) time, and insertion and deletion run in expected amortized O(log log n + p) time, where p is a bound on the number of threads. To improve the scalability of concurrent insertion and deletion, we propose a novel parallel rebuilding technique, which should be of independent interest.
We evaluate whether the theoretical improvements translate to practice by implementing the concurrent interpolation search tree, and benchmarking it on uniform and nonuniform key distributions, for dataset sizes in the millions to billions of keys. Relative to the state-of-the-art concurrent data structures, the concurrent interpolation search tree achieves performance improvements of up to 15% under high update rates, and of up to 50% under moderate update rates. Further, ISTs exhibit up to 2X less cache-misses, and consume 1.2 -- 2.6X less memory compared to the next best alternative on typical dataset sizes. We find that the results are surprisingly robust to distributional skew, which suggests that our data structure can be a promising alternative to classic concurrent search structures.
AU - Brown, Trevor A
AU - Prokopec, Aleksandar
AU - Alistarh, Dan-Adrian
ID - 7636
SN - 9781450368186
T2 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
TI - Non-blocking interpolation search trees with doubly-logarithmic running time
ER -
TY - CONF
AB - The design and implementation of efficient concurrent data structures have
seen significant attention. However, most of this work has focused on
concurrent data structures providing good \emph{worst-case} guarantees. In real
workloads, objects are often accessed at different rates, since access
distributions may be non-uniform. Efficient distribution-adaptive data
structures are known in the sequential case, e.g. the splay-trees; however,
they often are hard to translate efficiently in the concurrent case.
In this paper, we investigate distribution-adaptive concurrent data
structures and propose a new design called the splay-list. At a high level, the
splay-list is similar to a standard skip-list, with the key distinction that
the height of each element adapts dynamically to its access rate: popular
elements ``move up,'' whereas rarely-accessed elements decrease in height. We
show that the splay-list provides order-optimal amortized complexity bounds for
a subset of operations while being amenable to efficient concurrent
implementation. Experimental results show that the splay-list can leverage
distribution-adaptivity to improve on the performance of classic concurrent
designs, and can outperform the only previously-known distribution-adaptive
design in certain settings.
AU - Aksenov, Vitaly
AU - Alistarh, Dan-Adrian
AU - Drozdova, Alexandra
AU - Mohtashami, Amirkeivan
ID - 8725
SN - 1868-8969
T2 - 34th International Symposium on Distributed Computing
TI - The splay-list: A distribution-adaptive concurrent skip-list
VL - 179
ER -
TY - CONF
AB - We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is
known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily
corrupt a fixed fraction of the data sources. Our main results are a generalization bound that provides finite-sample guarantees for this learning setting, as well as corresponding lower bounds. Besides establishing PAC-learnability our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some
participants are malicious.
AU - Konstantinov, Nikola H
AU - Frantar, Elias
AU - Alistarh, Dan-Adrian
AU - Lampert, Christoph
ID - 8724
SN - 2640-3498
T2 - Proceedings of the 37th International Conference on Machine Learning
TI - On the sample complexity of adversarial multi-source PAC learning
VL - 119
ER -
TY - CONF
AB - Optimizing convolutional neural networks for fast inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation functions. In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains. To induce highly sparse activation maps without accuracy loss, we introduce a new regularization technique, coupled with a new threshold-based sparsification method based on a parameterized activation function called Forced-Activation-Threshold Rectified Linear Unit (FATReLU). We examine the impact of our methods on popular image classification models, showing that most architectures can adapt to significantly sparser activation maps without any accuracy loss. Our second contribution is showing that these these compression gains can be translated into inference speedups: we provide a new algorithm to enable fast convolution operations over networks with sparse activations, and show that it can enable significant speedups for end-to-end inference on a range of popular models on the large-scale ImageNet image classification task on modern Intel CPUs, with little or no retraining cost.
AU - Kurtz, Mark
AU - Kopinsky, Justin
AU - Gelashvili, Rati
AU - Matveev, Alexander
AU - Carr, John
AU - Goin, Michael
AU - Leiserson, William
AU - Moore, Sage
AU - Nell, Bill
AU - Shavit, Nir
AU - Alistarh, Dan-Adrian
ID - 9415
SN - 2640-3498
T2 - 37th International Conference on Machine Learning, ICML 2020
TI - Inducing and exploiting activation sparsity for fast neural network inference
VL - 119
ER -
TY - CONF
AB - Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep
neural networks; however, relatively little is known about the quality of existing approximations in this context. Our work examines this question, identifies issues with existing approaches, and proposes a method called WoodFisher to compute a faithful and efficient estimate of the inverse Hessian. Our main application is to neural network compression, where we build on the classic Optimal Brain Damage/Surgeon framework. We demonstrate that WoodFisher significantly outperforms popular state-of-the-art methods for oneshot pruning. Further, even when iterative, gradual pruning is allowed, our method results in a gain in test accuracy over the state-of-the-art approaches, for standard image classification datasets such as ImageNet ILSVRC. We examine how our method can be extended to take into account first-order information, as well as
illustrate its ability to automatically set layer-wise pruning thresholds and perform compression in the limited-data regime. The code is available at the following link, https://github.com/IST-DASLab/WoodFisher.
AU - Singh, Sidak Pal
AU - Alistarh, Dan-Adrian
ID - 9632
SN - 10495258
T2 - Advances in Neural Information Processing Systems
TI - WoodFisher: Efficient second-order approximation for neural network compression
VL - 33
ER -
TY - CONF
AB - The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into developing efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence. In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wall-clock convergence time, on a range of practical applications.
AU - Aksenov, Vitaly
AU - Alistarh, Dan-Adrian
AU - Korhonen, Janne
ID - 9631
SN - 10495258
T2 - Advances in Neural Information Processing Systems
TI - Scalable belief propagation via relaxed scheduling
VL - 33
ER -
TY - CONF
AB - Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.
AU - Renggli, Cedric
AU - Ashkboos, Saleh
AU - Aghagolzadeh, Mehdi
AU - Alistarh, Dan-Adrian
AU - Hoefler, Torsten
ID - 7201
SN - 21674329
T2 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
TI - SparCML: High-performance sparse communication for machine learning
ER -