TY - JOUR
AB - The Massively Parallel Computation (MPC) model is an emerging model that distills core aspects of distributed and parallel computation, developed as a tool to solve combinatorial (typically graph) problems in systems of many machines with limited space. Recent work has focused on the regime in which machines have sublinear (in n, the number of nodes in the input graph) space, with randomized algorithms presented for the fundamental problems of Maximal Matching and Maximal Independent Set. However, there have been no prior corresponding deterministic algorithms. A major challenge underlying the sublinear space setting is that the local space of each machine might be too small to store all edges incident to a single node. This poses a considerable obstacle compared to classical models in which each node is assumed to know and have easy access to its incident edges. To overcome this barrier, we introduce a new graph sparsification technique that deterministically computes a low-degree subgraph, with the additional property that solving the problem on this subgraph provides significant progress towards solving the problem for the original input graph. Using this framework to derandomize the well-known algorithm of Luby [SICOMP’86], we obtain O(log Δ + log log n)-round deterministic MPC algorithms for solving the problems of Maximal Matching and Maximal Independent Set with O(nɛ) space on each machine for any constant ɛ > 0. These algorithms also run in O(log Δ) rounds in the closely related model of CONGESTED CLIQUE, improving upon the state-of-the-art bound of O(log 2Δ) rounds by Censor-Hillel et al. [DISC’17].
AU - Czumaj, Artur
AU - Davies, Peter
AU - Parter, Merav
ID - 9541
IS - 2
JF - ACM Transactions on Algorithms
SN - 1549-6325
TI - Graph sparsification for derandomizing massively parallel computation with low space
VL - 17
ER -
TY - THES
AB - Deep learning is best known for its empirical success across a wide range of applications
spanning computer vision, natural language processing and speech. Of equal significance,
though perhaps less known, are its ramifications for learning theory: deep networks have
been observed to perform surprisingly well in the high-capacity regime, aka the overfitting
or underspecified regime. Classically, this regime on the far right of the bias-variance curve
is associated with poor generalisation; however, recent experiments with deep networks
challenge this view.
This thesis is devoted to investigating various aspects of underspecification in deep learning.
First, we argue that deep learning models are underspecified on two levels: a) any given
training dataset can be fit by many different functions, and b) any given function can be
expressed by many different parameter configurations. We refer to the second kind of
underspecification as parameterisation redundancy and we precisely characterise its extent.
Second, we characterise the implicit criteria (the inductive bias) that guide learning in the
underspecified regime. Specifically, we consider a nonlinear but tractable classification
setting, and show that given the choice, neural networks learn classifiers with a large margin.
Third, we consider learning scenarios where the inductive bias is not by itself sufficient to
deal with underspecification. We then study different ways of ‘tightening the specification’: i)
In the setting of representation learning with variational autoencoders, we propose a hand-
crafted regulariser based on mutual information. ii) In the setting of binary classification, we
consider soft-label (real-valued) supervision. We derive a generalisation bound for linear
networks supervised in this way and verify that soft labels facilitate fast learning. Finally, we
explore an application of soft-label supervision to the training of multi-exit models.
AU - Bui Thi Mai, Phuong
ID - 9418
TI - Underspecification in Deep Learning
ER -
TY - CONF
AB - We consider the problem ofdistributed mean estimation (DME), in which n machines are each given a local d-dimensional vector xv∈Rd, and must cooperate to estimate the mean of their inputs μ=1n∑nv=1xv, while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants of this problem, especially in the context of distributed variance reduction for stochastic gradients in parallel SGD. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an error bound in terms of this norm. However, in many real applications, the input vectors are concentrated around the correct output μ, but μ itself has large norm. In such cases, previous output error bounds perform poorly. In this paper, we show that output error bounds need not depend on input norm. We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction. The technique is based on a new connection with lattice theory. We also provide lower bounds showing that the communication to error trade-off of our algorithms is asymptotically optimal. As the lattices achieving optimal bounds under l2-norm can be computationally impractical, we also present an extension which leverages easy-to-use cubic lattices, and is loose only up to a logarithmic factor ind. We show experimentally that our method yields practical improvements for common applications, relative to prior approaches.
AU - Davies, Peter
AU - Gurunanthan, Vijaykrishna
AU - Moshrefi, Niusha
AU - Ashkboos, Saleh
AU - Alistarh, Dan-Adrian
ID - 9543
T2 - 9th International Conference on Learning Representations
TI - New bounds for distributed mean estimation and variance reduction
ER -
TY - CONF
AB - We study the inductive bias of two-layer ReLU networks trained by gradient flow. We identify a class of easy-to-learn (`orthogonally separable') datasets, and characterise the solution that ReLU networks trained on such datasets converge to. Irrespective of network width, the solution turns out to be a combination of two max-margin classifiers: one corresponding to the positive data subset and one corresponding to the negative data subset. The proof is based on the recently introduced concept of extremal sectors, for which we prove a number of properties in the context of orthogonal separability. In particular, we prove stationarity of activation patterns from some time onwards, which enables a reduction of the ReLU network to an ensemble of linear subnetworks.
AU - Bui Thi Mai, Phuong
AU - Lampert, Christoph
ID - 9416
T2 - 9th International Conference on Learning Representations
TI - The inductive bias of ReLU networks on orthogonally separable data
ER -
TY - JOUR
AB - We prove that every n-vertex tournament G has an acyclic subgraph with chromatic number at least n5/9−o(1), while there exists an n-vertex tournament G whose every acyclic subgraph has chromatic number at most n3/4+o(1). This establishes in a strong form a conjecture of Nassar and Yuster and improves on another result of theirs. Our proof combines probabilistic and spectral techniques together with some additional ideas. In particular, we prove a lemma showing that every tournament with many transitive subtournaments has a large subtournament that is almost transitive. This may be of independent interest.
AU - Fox, Jacob
AU - Kwan, Matthew Alan
AU - Sudakov, Benny
ID - 9572
IS - 2
JF - Bulletin of the London Mathematical Society
SN - 0024-6093
TI - Acyclic subgraphs of tournaments with high chromatic number
VL - 53
ER -