TY - CONF AB - Asynchronous programming has gained significant popularity over the last decade: support for this programming pattern is available in many popular languages via libraries and native language implementations, typically in the form of coroutines or the async/await construct. Instead of programming via shared memory, this concept assumes implicit synchronization through message passing. The key data structure enabling such communication is the rendezvous channel. Roughly, a rendezvous channel is a blocking queue of size zero, so both send(e) and receive() operations wait for each other, performing a rendezvous when they meet. To optimize the message passing pattern, channels are usually equipped with a fixed-size buffer, so sends do not suspend and put elements into the buffer until its capacity is exceeded. This primitive is known as a buffered channel. This paper presents a fast and scalable algorithm for both rendezvous and buffered channels. Similarly to modern queues, our solution is based on an infinite array with two positional counters for send(e) and receive() operations, leveraging the unconditional Fetch-And-Add instruction to update them. Yet, the algorithm requires non-trivial modifications of this classic pattern, in order to support the full channel semantics, such as buffering and cancellation of waiting requests. We compare the performance of our solution to that of the Kotlin implementation, as well as against other academic proposals, showing up to 9.8× speedup. To showcase its expressiveness and performance, we also integrated the proposed algorithm into the standard Kotlin Coroutines library, replacing the previous channel implementations. AU - Koval, Nikita AU - Alistarh, Dan-Adrian AU - Elizarov, Roman ID - 12735 SN - 9798400700156 T2 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming TI - Fast and scalable channels in Kotlin Coroutines ER - TY - JOUR AB - Writing concurrent code that is both correct and efficient is notoriously difficult. Thus, programmers often prefer to use synchronization abstractions, which render code simpler and easier to reason about. Despite a wealth of work on this topic, there is still a gap between the rich semantics provided by synchronization abstractions in modern programming languages—specifically, fair FIFO ordering of synchronization requests and support for abortable operations—and frameworks for implementing it correctly and efficiently. Supporting such semantics is critical given the rising popularity of constructs for asynchronous programming, such as coroutines, which abort frequently and are cheaper to suspend and resume compared to native threads. This paper introduces a new framework called CancellableQueueSynchronizer (CQS), which enables simple yet efficient implementations of a wide range of fair and abortable synchronization primitives: mutexes, semaphores, barriers, count-down latches, and blocking pools. Our main contribution is algorithmic, as implementing both fairness and abortability efficiently at this level of generality is non-trivial. Importantly, all our algorithms, including the CQS framework and the primitives built on top of it, come with formal proofs in the Iris framework for Coq for many of their properties. These proofs are modular, so it is easy to show correctness for new primitives implemented on top of CQS. From a practical perspective, implementation of CQS for native threads on the JVM improves throughput by up to two orders of magnitude over Java’s AbstractQueuedSynchronizer, the only practical abstraction offering similar semantics. Further, we successfully integrated CQS as a core component of the popular Kotlin Coroutines library, validating the framework’s practical impact and expressiveness in a real-world environment. In sum, CancellableQueueSynchronizer is the first framework to combine expressiveness with formal guarantees and solid practical performance. Our approach should be extensible to other languages and families of synchronization primitives. AU - Koval, Nikita AU - Khalanskiy, Dmitry AU - Alistarh, Dan-Adrian ID - 13179 JF - Proceedings of the ACM on Programming Languages TI - CQS: A formally-verified framework for fair and abortable synchronization VL - 7 ER - TY - CONF AB - This paper presents Lincheck, a new practical and user-friendly framework for testing concurrent algorithms on the Java Virtual Machine (JVM). Lincheck provides a simple and declarative way to write concurrent tests: instead of describing how to perform the test, users specify what to test by declaring all the operations to examine; the framework automatically handles the rest. As a result, tests written with Lincheck are concise and easy to understand. The framework automatically generates a set of concurrent scenarios, examines them using stress-testing or bounded model checking, and verifies that the results of each invocation are correct. Notably, if an error is detected via model checking, Lincheck provides an easy-to-follow trace to reproduce it, significantly simplifying the bug investigation. To the best of our knowledge, Lincheck is the first production-ready tool on the JVM that offers such a simple way of writing concurrent tests, without requiring special skills or expertise. We successfully integrated Lincheck in the development process of several large projects, such as Kotlin Coroutines, and identified new bugs in popular concurrency libraries, such as a race in Java’s standard ConcurrentLinkedDeque and a liveliness bug in Java’s AbstractQueuedSynchronizer framework, which is used in most of the synchronization primitives. We believe that Lincheck can significantly improve the quality and productivity of concurrent algorithms research and development and become the state-of-the-art tool for checking their correctness. AU - Koval, Nikita AU - Fedorov, Alexander AU - Sokolova, Maria AU - Tsitelov, Dmitry AU - Alistarh, Dan-Adrian ID - 14260 SN - 0302-9743 T2 - 35th International Conference on Computer Aided Verification TI - Lincheck: A practical framework for testing concurrent data structures on JVM VL - 13964 ER - TY - GEN AB - Lincheck is a new practical and user-friendly framework for testing concurrent data structures on the Java Virtual Machine (JVM). It provides a simple and declarative way to write concurrent tests. Instead of describing how to perform the test, users specify what to test by declaring all the operations to examine; the framework automatically handles the rest. As a result, tests written with Lincheck are concise and easy to understand. The artifact presents a collection of Lincheck tests that discover new bugs in popular libraries and implementations from the concurrency literature -- they are listed in Table 1, Section 3. To evaluate the performance of Lincheck analysis, the collection of tests also includes those which check correct data structures and, thus, always succeed. Similarly to Table 2, Section 3, the experiments demonstrate the reasonable time to perform a test. Finally, Lincheck provides user-friendly output with an easy-to-follow trace to reproduce a detected error, significantly simplifying further investigation. AU - Koval, Nikita AU - Fedorov, Alexander AU - Sokolova, Maria AU - Tsitelov, Dmitry AU - Alistarh, Dan-Adrian ID - 14995 TI - Lincheck: A practical framework for testing concurrent data structures on JVM ER - TY - CONF AB - Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given n threads and m ≥ n distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. This approach scales well, and has probabilistic rank guarantees: roughly, the rank of each task removed, relative to remaining tasks in all other queues, is O (m) in expectation. Yet, the performance of this pattern is below that of well-engineered schedulers, which eschew theoretical guarantees for practical efficiency. We investigate whether it is possible to design and implement a Multi-Queue-based task scheduler that is both highly-efficient and has analytical guarantees. We propose a new variant called the Stealing Multi-Queue (SMQ), a cache-efficient variant of the Multi-Queue, which leverages both queue affinity---each thread has a local queue, from which tasks are usually removed; but, with some probability, threads also attempt to steal higher-priority tasks from the other queues---and task batching, that is, the processing of several tasks in a single insert / remove step. These ideas are well-known for task scheduling without priorities; our theoretical contribution is showing that, despite relaxations, this design can still provide rank guarantees, which in turn implies bounds on total work performed. We provide a general SMQ implementation which can surpass state-of-the-art schedulers such as OBIM and PMOD in terms of performance on popular graph-processing benchmarks. Notably, the performance improvement comes mainly from the superior rank guarantees provided by our scheduler, confirming that analytically-reasoned approaches can still provide performance improvements for priority task scheduling. AU - Postnikova, Anastasiia AU - Koval, Nikita AU - Nadiradze, Giorgi AU - Alistarh, Dan-Adrian ID - 11180 SN - 9781450392044 T2 - Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming TI - Multi-queues can be state-of-the-art priority schedulers ER - TY - GEN AB - The source code for replicating experiments presented in the paper. The implementation of the designed priority schedulers can be found in Galois-2.2.1/include/Galois/WorkList/: StealingMultiQueue.h is the StealingMultiQueue. MQOptimized/ contains MQ Optimized variants. We provide images that contain all the dependencies and datasets. Images can be pulled from npostnikova/mq-based-schedulers repository, or downloaded from Zenodo. See readme for more detail. AU - Postnikova, Anastasiia AU - Koval, Nikita AU - Nadiradze, Giorgi AU - Alistarh, Dan-Adrian ID - 13076 TI - Multi-queues can be state-of-the-art priority schedulers ER - TY - CONF AB - Union-Find (or Disjoint-Set Union) is one of the fundamental problems in computer science; it has been well-studied from both theoretical and practical perspectives in the sequential case. Recently, there has been mounting interest in analyzing this problem in the concurrent scenario, and several asymptotically-efficient algorithms have been proposed. Yet, to date, there is very little known about the practical performance of concurrent Union-Find. This work addresses this gap. We evaluate and analyze the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks, as well as integrations into more complex algorithms). We first observe that, due to the limited computational cost, the number of induced cache misses is the critical determining factor for the performance of existing algorithms. We introduce new techniques to reduce this cost by storing node priorities implicitly and by using plain reads and writes in a way that does not affect the correctness of the algorithms. Finally, we show that Union-Find implementations are an interesting application for Transactional Memory (TM): one of the fastest algorithm variants we discovered is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability. AU - Alistarh, Dan-Adrian AU - Fedorov, Alexander AU - Koval, Nikita ID - 7605 SN - 18688969 T2 - 23rd International Conference on Principles of Distributed Systems TI - In search of the fastest concurrent union-find algorithm VL - 153 ER - TY - CONF AB - Concurrent programming can be notoriously complex and error-prone. Programming bugs can arise from a variety of sources, such as operation re-reordering, or incomplete understanding of the memory model. A variety of formal and model checking methods have been developed to address this fundamental difficulty. While technically interesting, existing academic methods are still hard to apply to the large codebases typical of industrial deployments, which limits their practical impact. AU - Koval, Nikita AU - Sokolova, Mariia AU - Fedorov, Alexander AU - Alistarh, Dan-Adrian AU - Tsitelov, Dmitry ID - 7635 SN - 9781450368186 T2 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP TI - Testing concurrency on the JVM with Lincheck ER - TY - GEN AB - Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) [1] and actor [2] models, which share data via explicit communication. Rendezvous channelis the common abstraction for communication between several processes, where senders and receivers perform a rendezvous handshake as a part of their protocol (senders wait for receivers and vice versa). Additionally to this, channels support the select expression. In this work, we present the first efficient lock-free channel algorithm, and compare it against Go [3] and Kotlin [4] baseline implementations. AU - Koval, Nikita AU - Alistarh, Dan-Adrian AU - Elizarov, Roman ID - 6485 SN - 9781450362252 T2 - Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming TI - Lock-free channels for programming via communicating sequential processes ER - TY - CONF AB - Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) and actor models, which share data via explicit communication. These models have been known for almost half a century, and have recently had started to gain significant traction among modern programming languages. The common abstraction for communication between several processes is the channel. Although channels are similar to producer-consumer data structures, they have different semantics and support additional operations, such as the select expression. Despite their growing popularity, most known implementations of channels use lock-based data structures and can be rather inefficient. In this paper, we present the first efficient lock-free algorithm for implementing a communication channel for CSP programming. We provide implementations and experimental results in the Kotlin and Go programming languages. Our new algorithm outperforms existing implementations on many workloads, while providing non-blocking progress guarantee. Our design can serve as an example of how to construct general communication data structures for CSP and actor models. AU - Koval, Nikita AU - Alistarh, Dan-Adrian AU - Elizarov, Roman ID - 7228 SN - 0302-9743 T2 - 25th Anniversary of Euro-Par TI - Scalable FIFO channels for programming via communicating sequential processes VL - 11725 ER - TY - CONF AB - Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to parallelizing such algorithms is to relax this constraint to allow for out-of-order parallel execution. This is the case for parallel implementations of Dijkstra's single-source shortest-paths (SSSP) algorithm, and for parallel Delaunay mesh triangulation. While many software frameworks parallelize incremental computation in this way, it is still not well understood whether this relaxed ordering approach can still provide any complexity guarantees. In this paper, we address this problem, and analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers. We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of k in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of O(łog n poly(k)), where n is the number of tasks to be executed. For SSSP, we show that the additional work is O(poly(k), dmax / wmin), where dmax is the maximum distance between two nodes, and wmin is the minimum such distance. In practical settings where n >> k, this suggests that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler. On the negative side, we provide lower bounds showing that certain algorithms will inherently incur a non-trivial amount of wasted work due to scheduler relaxation, even for relatively benign relaxed schedulers. AU - Alistarh, Dan-Adrian AU - Nadiradze, Giorgi AU - Koval, Nikita ID - 6673 SN - 9781450361842 T2 - 31st ACM Symposium on Parallelism in Algorithms and Architectures TI - Efficiency guarantees for parallel incremental algorithms under relaxed schedulers ER -