Distributed, Parallel, and Cluster Computing
- [1] arXiv:2405.10057 [pdf, ps, other]
-
Title: AMECOS: A Modular Event-based Framework for Concurrent Object SpecificationTimothé Albouy (IRISA), Antonio Fernández Anta, Chryssis Georgiou (UCY), Mathieu Gestin, Nicolas Nicolaou, Junlang WangSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
In this work, we introduce a modular framework for specifying distributed systems that we call AMECOS. Specifically, our framework departs from the traditional use of sequential specification, which presents limitations both on the specification expressiveness and implementation efficiency of inherently concurrent objects, as documented by Casta{ñ}eda, Rajsbaum and Raynal in CACM 2023. Our framework focuses on the interface between the various system components specified as concurrent objects. Interactions are described with sequences of object events. This provides a modular way of specifying distributed systems and separates legality (object semantics) from other issues, such as consistency. We demonstrate the usability of our framework by (i) specifying various well-known concurrent objects, such as shared memory, asynchronous message-passing, and reliable broadcast, (ii) providing hierarchies of ordering semantics (namely, consistency hierarchy, memory hierarchy, and reliable broadcast hierarchy), and (iii) presenting novel axiomatic proofs of the impossibility of the well-known Consensus and wait-free Set Agreement problems.
- [2] arXiv:2405.10058 [pdf, ps, html, other]
-
Title: Distributed Coloring in the SLEEPING ModelSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
In distributed network computing, a variant of the LOCAL model has been recently introduced, referred to as the SLEEPING model. In this model, nodes have the ability to decide on which round they are awake, and on which round they are sleeping. Two (adjacent) nodes can exchange messages in a round only if both of them are awake in that round. The SLEEPING model captures the ability of nodes to save energy when they are sleeping. In this framework, a major question is the following: is it possible to design algorithms that are energy efficient, i.e., where each node is awake for a few number of rounds only, without losing too much on the time efficiency, i.e., on the total number of rounds? This paper answers positively to this question, for one of the most fundamental problems in distributed network computing, namely $(\Delta+1)$-coloring networks of maximum degree $\Delta$. We provide a randomized algorithm with average awake-complexity constant, maximum awake-complexity $O(\log\log n)$ in $n$-node networks, and round-complexity $poly\!\log n$.
- [3] arXiv:2405.10249 [pdf, ps, html, other]
-
Title: Unifying Partial SynchronySubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The distributed computing literature considers multiple options for modeling communication. Most simply, communication is categorized as either synchronous or asynchronous. Synchronous communication assumes that messages get delivered within a publicly known timeframe and that parties' clocks are synchronized. Asynchronous communication, on the other hand, only assumes that messages get delivered eventually. A more nuanced approach, or a middle ground between the two extremes, is given by the partially synchronous model, which is arguably the most realistic option. This model comes in two commonly considered flavors:
(i) The Global Stabilization Time (GST) model: after an (unknown) amount of time, the network becomes synchronous. This captures scenarios where network issues are transient.
(ii) The Unknown Latency (UL) model: the network is, in fact, synchronous, but the message delay bound is unknown.
This work formally establishes that any time-agnostic property that can be achieved by a protocol in the UL model can also be achieved by a (possibly different) protocol in the GST model. By time-agnostic, we mean properties that can depend on the order in which events happen but not on time as measured by the parties. Most properties considered in distributed computing are time-agnostic. The converse was already known, even without the time-agnostic requirement, so our result shows that the two network conditions are, under one sensible assumption, equally demanding.
New submissions for Friday, 17 May 2024 (showing 3 of 3 entries )
- [4] arXiv:2405.09746 (cross-list from cs.IT) [pdf, ps, html, other]
-
Title: Algebraic Geometric Rook Codes for Coded Distributed ComputingComments: 6 pagesSubjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Discrete Mathematics (cs.DM); Algebraic Geometry (math.AG)
We extend coded distributed computing over finite fields to allow the number of workers to be larger than the field size. We give codes that work for fully general matrix multiplication and show that in this case we serendipitously have that all functions can be computed in a distributed fault-tolerant fashion over finite fields. This generalizes previous results on the topic. We prove that the associated codes achieve a recovery threshold similar to the ones for characteristic zero fields but now with a factor that is proportional to the genus of the underlying function field. In particular, we have that the recovery threshold of these codes is proportional to the classical complexity of matrix multiplication by a factor of at most the genus.
- [5] arXiv:2405.09892 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Balancing Similarity and Complementarity for Federated LearningKunda Yan, Sen Cui, Abudukelimu Wuerkaixi, Jingfeng Zhang, Bo Han, Gang Niu, Masashi Sugiyama, Changshui ZhangSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
In mobile and IoT systems, Federated Learning (FL) is increasingly important for effectively using data while maintaining user privacy. One key challenge in FL is managing statistical heterogeneity, such as non-i.i.d. data, arising from numerous clients and diverse data sources. This requires strategic cooperation, often with clients having similar characteristics. However, we are interested in a fundamental question: does achieving optimal cooperation necessarily entail cooperating with the most similar clients? Typically, significant model performance improvements are often realized not by partnering with the most similar models, but through leveraging complementary data. Our theoretical and empirical analyses suggest that optimal cooperation is achieved by enhancing complementarity in feature distribution while restricting the disparity in the correlation between features and targets. Accordingly, we introduce a novel framework, \texttt{FedSaC}, which balances similarity and complementarity in FL cooperation. Our framework aims to approximate an optimal cooperation network for each client by optimizing a weighted sum of model similarity and feature complementarity. The strength of \texttt{FedSaC} lies in its adaptability to various levels of data heterogeneity and multimodal scenarios. Our comprehensive unimodal and multimodal experiments demonstrate that \texttt{FedSaC} markedly surpasses other state-of-the-art FL methods.
- [6] arXiv:2405.09903 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Federated Learning for Misbehaviour Detection with Variational Autoencoders and Gaussian Mixture ModelsComments: 13 pages, 11 figures, 3 tablesSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) has become an attractive approach to collaboratively train Machine Learning (ML) models while data sources' privacy is still preserved. However, most of existing FL approaches are based on supervised techniques, which could require resource-intensive activities and human intervention to obtain labelled datasets. Furthermore, in the scope of cyberattack detection, such techniques are not able to identify previously unknown threats. In this direction, this work proposes a novel unsupervised FL approach for the identification of potential misbehavior in vehicular environments. We leverage the computing capabilities of public cloud services for model aggregation purposes, and also as a central repository of misbehavior events, enabling cross-vehicle learning and collective defense strategies. Our solution integrates the use of Gaussian Mixture Models (GMM) and Variational Autoencoders (VAE) on the VeReMi dataset in a federated environment, where each vehicle is intended to train only with its own data. Furthermore, we use Restricted Boltzmann Machines (RBM) for pre-training purposes, and Fedplus as aggregation function to enhance model's convergence. Our approach provides better performance (more than 80 percent) compared to recent proposals, which are usually based on supervised techniques and artificial divisions of the VeReMi dataset.
- [7] arXiv:2405.09975 (cross-list from cs.DS) [pdf, ps, html, other]
-
Title: Distributed Delta-Coloring under Bandwidth LimitationsSubjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)
We consider the problem of coloring graphs of maximum degree $\Delta$ with $\Delta$ colors in the distributed setting with limited bandwidth. Specifically, we give a $\mathsf{poly}\log\log n$-round randomized algorithm in the CONGEST model. This is close to the lower bound of $\Omega(\log \log n)$ rounds from [Brandt et al., STOC '16], which holds also in the more powerful LOCAL model. The core of our algorithm is a reduction to several special instances of the constructive Lovász local lemma (LLL) and the $deg+1$-list coloring problem.
- [8] arXiv:2405.10096 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: The Effect of Quantization in Federated Learning: A R\'enyi Differential Privacy PerspectiveComments: 6 pages, 5 figures, submitted to 2024 IEEE MeditComSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) is an emerging paradigm that holds great promise for privacy-preserving machine learning using distributed data. To enhance privacy, FL can be combined with Differential Privacy (DP), which involves adding Gaussian noise to the model weights. However, FL faces a significant challenge in terms of large communication overhead when transmitting these model weights. To address this issue, quantization is commonly employed. Nevertheless, the presence of quantized Gaussian noise introduces complexities in understanding privacy protection. This research paper investigates the impact of quantization on privacy in FL systems. We examine the privacy guarantees of quantized Gaussian mechanisms using Rényi Differential Privacy (RDP). By deriving the privacy budget of quantized Gaussian mechanisms, we demonstrate that lower quantization bit levels provide improved privacy protection. To validate our theoretical findings, we employ Membership Inference Attacks (MIA), which gauge the accuracy of privacy leakage. The numerical results align with our theoretical analysis, confirming that quantization can indeed enhance privacy protection. This study not only enhances our understanding of the correlation between privacy and communication in FL but also underscores the advantages of quantization in preserving privacy.
- [9] arXiv:2405.10123 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Asynchronous Federated Stochastic Optimization with Exact Averaging for Heterogeneous Local ObjectivesSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) was recently proposed to securely train models with data held over multiple locations ("clients") under the coordination of a central server. Two major challenges hindering the performance of FL algorithms are long training times caused by straggling clients and a decrease in training accuracy induced by non-iid local distributions ("client drift"). In this work we propose and analyze AREA, a new stochastic (sub)gradient algorithm that is robust to client drift and utilizes asynchronous communication to speed up convergence in the presence of stragglers. Moreover, AREA is, to the best of our knowledge, the first method that is both guaranteed to converge under arbitrarily long delays, and converges to an error neighborhood whose size depends only on the variance of the stochastic (sub)gradients used and thus is independent of both the heterogeneity between the local datasets and the length of client delays, without the use of delay-adaptive stepsizes. Our numerical results confirm our theoretical analysis and suggest that AREA outperforms state-of-the-art methods when local data are highly non-iid.
- [10] arXiv:2405.10271 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Automated Federated Learning via Informed PruningChristian Internò, Elena Raponi, Niki van Stein, Thomas Bäck, Markus Olhofer, Yaochu Jin, Barbara HammerSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
Federated learning (FL) represents a pivotal shift in machine learning (ML) as it enables collaborative training of local ML models coordinated by a central aggregator, all without the need to exchange local data. However, its application on edge devices is hindered by limited computational capabilities and data communication challenges, compounded by the inherent complexity of Deep Learning (DL) models. Model pruning is identified as a key technique for compressing DL models on devices with limited resources. Nonetheless, conventional pruning techniques typically rely on manually crafted heuristics and demand human expertise to achieve a balance between model size, speed, and accuracy, often resulting in sub-optimal solutions.
In this study, we introduce an automated federated learning approach utilizing informed pruning, called AutoFLIP, which dynamically prunes and compresses DL models within both the local clients and the global server. It leverages a federated loss exploration phase to investigate model gradient behavior across diverse datasets and losses, providing insights into parameter significance. Our experiments showcase notable enhancements in scenarios with strong non-IID data, underscoring AutoFLIP's capacity to tackle computational constraints and achieve superior global convergence.
Cross submissions for Friday, 17 May 2024 (showing 7 of 7 entries )
- [11] arXiv:2306.04421 (replaced) [pdf, ps, html, other]
-
Title: The Renoir Dataflow Platform: Efficient Data Processing without ComplexityComments: Submitted to Future Generation Computer Systems, December 2023Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Today, data analysis drives the decision-making process in virtually every human activity. This demands for software platforms that offer simple programming abstractions to express data analysis tasks and that can execute them in an efficient and scalable way. State-of-the-art solutions range from low-level programming primitives, which give control to the developer about communication and resource usage, but require significant effort to develop and optimize new algorithms, to high-level platforms that hide most of the complexities of parallel and distributed processing, but often at the cost of reduced efficiency. To reconcile these requirements, we developed Renoir, a novel distributed data processing platform written in Rust. Renoir provides a high-level dataflow programming model as mainstream data processing systems. It supports static and streaming data, it enables data transformations, grouping, aggregation, iterative computations, and time-based analytics, incurring in a low overhead. This paper presents In this paper, we present the programming model and the implementation details of Renoir. We evaluate it under heterogeneous workloads. We compare it with state-of-the-art solutions for data analysis and high-performance computing, as well as alternative research products, which offer different programming abstractions and implementation strategies. Renoir programs are compact and easy to write: developers need not care about low-level concerns such as resource usage, data serialization, concurrency control, and communication. Renoir consistently presents comparable or better performance than competing solutions, by a large margin in several scenarios. We conclude that Renoir offers a good tradeoff between simplicity and performance, allowing developers to easily express complex data analysis tasks and achieve high performance and scalability.
- [12] arXiv:2403.09347 (replaced) [pdf, ps, html, other]
-
Title: BurstAttention: An Efficient Distributed Attention Framework for Extremely Long SequencesComments: 13 pages, 7 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.
- [13] arXiv:2404.15888 (replaced) [pdf, ps, other]
-
Title: Near-Optimal Wafer-Scale ReduceComments: To appear at HPDC 2024Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.
- [14] arXiv:2311.07377 (replaced) [pdf, ps, html, other]
-
Title: Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal ApproachXi Zheng, Aloysius K. Mok, Ruzica Piskac, Yong Jae Lee, Bhaskar Krishnamachari, Dakai Zhu, Oleg Sokolsky, Insup LeeSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Robotics (cs.RO)
The integration of machine learning (ML) into cyber-physical systems (CPS) offers significant benefits, including enhanced efficiency, predictive capabilities, real-time responsiveness, and the enabling of autonomous operations. This convergence has accelerated the development and deployment of a range of real-world applications, such as autonomous vehicles, delivery drones, service robots, and telemedicine procedures. However, the software development life cycle (SDLC) for AI-infused CPS diverges significantly from traditional approaches, featuring data and learning as two critical components. Existing verification and validation techniques are often inadequate for these new paradigms. In this study, we pinpoint the main challenges in ensuring formal safety for learningenabled CPS.We begin by examining testing as the most pragmatic method for verification and validation, summarizing the current state-of-the-art methodologies. Recognizing the limitations in current testing approaches to provide formal safety guarantees, we propose a roadmap to transition from foundational probabilistic testing to a more rigorous approach capable of delivering formal assurance.
- [15] arXiv:2401.11592 (replaced) [pdf, ps, html, other]
-
Title: Differentially-Private Hierarchical Federated LearningSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
While federated learning (FL) eliminates the transmission of raw data over a network, it is still vulnerable to privacy breaches from the communicated model parameters. In this work, we propose \underline{H}ierarchical \underline{F}ederated Learning with \underline{H}ierarchical \underline{D}ifferential \underline{P}rivacy ({\tt H$^2$FDP}), a DP-enhanced FL methodology for jointly optimizing privacy and performance in hierarchical networks. Building upon recent proposals for Hierarchical Differential Privacy (HDP), one of the key concepts of {\tt H$^2$FDP} is adapting DP noise injection at different layers of an established FL hierarchy -- edge devices, edge servers, and cloud servers -- according to the trust models within particular subnetworks. We conduct a comprehensive analysis of the convergence behavior of {\tt H$^2$FDP}, revealing conditions on parameter tuning under which the training process converges sublinearly to a finite stationarity gap that depends on the network hierarchy, trust model, and target privacy level.
Leveraging these relationships, we develop an adaptive control algorithm for {\tt H$^2$FDP} that tunes properties of local model training to minimize communication energy, latency, and the stationarity gap while striving to maintain a sub-linear convergence rate and meet desired privacy criteria.
Subsequent numerical evaluations demonstrate that {\tt H$^2$FDP} obtains substantial improvements in these metrics over baselines for different privacy budgets, and validate the impact of different system configurations. - [16] arXiv:2404.19725 (replaced) [pdf, ps, html, other]
-
Title: Fairness Without Demographics in Human-Centered Federated LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) enables collaborative model training while preserving data privacy, making it suitable for decentralized human-centered AI applications. However, a significant research gap remains in ensuring fairness in these systems. Current fairness strategies in FL require knowledge of bias-creating/sensitive attributes, clashing with FL's privacy principles. Moreover, in human-centered datasets, sensitive attributes may remain latent. To tackle these challenges, we present a novel bias mitigation approach inspired by "Fairness without Demographics" in machine learning. The presented approach achieves fairness without needing knowledge of sensitive attributes by minimizing the top eigenvalue of the Hessian matrix during training, ensuring equitable loss landscapes across FL participants. Notably, we introduce a novel FL aggregation scheme that promotes participating models based on error rates and loss landscape curvature attributes, fostering fairness across the FL system. This work represents the first approach to attaining "Fairness without Demographics" in human-centered FL. Through comprehensive evaluation, our approach demonstrates effectiveness in balancing fairness and efficacy across various real-world applications, FL setups, and scenarios involving single and multiple bias-inducing factors, representing a significant advancement in human-centered FL.
- [17] arXiv:2405.07601 (replaced) [pdf, ps, html, other]
-
Title: On-device Online Learning and Semantic Management of TinyML SystemsComments: Accepted by Journal Transactions on Embedded Computing Systems (TECS)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.