Distributed, Parallel, and Cluster Computing
See recent articles
Showing new listings for Tuesday, 1 October 2024
- [1] arXiv:2409.18972 [pdf, html, other]
-
Title: Instance Configuration for Sustainable Job Shop SchedulingComments: 26th International Workshop on Configuration (ConfWS 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
The Job Shop Scheduling Problem (JSP) is a pivotal challenge in operations research and is essential for evaluating the effectiveness and performance of scheduling algorithms. Scheduling problems are a crucial domain in combinatorial optimization, where resources (machines) are allocated to job tasks to minimize the completion time (makespan) alongside other objectives like energy consumption. This research delves into the intricacies of JSP, focusing on optimizing performance metrics and minimizing energy consumption while considering various constraints such as deadlines and release dates. Recognizing the multi-dimensional nature of benchmarking in JSP, this study underscores the significance of reference libraries and datasets like JSPLIB in enriching algorithm evaluation. The research highlights the importance of problem instance characteristics, including job and machine numbers, processing times, and machine availability, emphasizing the complexities introduced by energy consumption considerations.
An innovative instance configurator is proposed, equipped with parameters such as the number of jobs, machines, tasks, and speeds, alongside distributions for processing times and energy consumption. The generated instances encompass various configurations, reflecting real-world scenarios and operational constraints. These instances facilitate comprehensive benchmarking and evaluation of scheduling algorithms, particularly in contexts of energy efficiency. A comprehensive set of 500 test instances has been generated and made publicly available, promoting further research and benchmarking in JSP. These instances enable robust analyses and foster collaboration in developing advanced, energy-efficient scheduling solutions by providing diverse scenarios. - [2] arXiv:2409.19286 [pdf, html, other]
-
Title: IM: Optimizing Byzantine Consensus for High-Performance Distributed NetworksQingming Zeng (Harbin Institute of Technology, Shenzhen), Mo Li (The Chinese University of Hongkong, Shenzhen), Ximing Fu (Harbin Institute of Technology, Shenzhen), Chuanyi Liu (Harbin Institute of Technology, Shenzhen, Peng Cheng Laboratory, Shenzhen), Hui Jiang (Tsinghua University, Baidu Inc)Comments: 16 pages, 5 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Byzantine Fault Tolerant (BFT) consensus, a crucial component of blockchains, has made significant advancements. However, the efficiency of existing protocols can still be damaged by certain attacks from faulty nodes and network instability. In this paper, we propose a novel Shared Mempool (SMP) protocol, namely IM, that enhances performance under these attacks. Technically, IM organizing microblocks into chains, combined with coding techniques, achieves totality and availability efficiently. IM can be easily integrated into a BFT protocol. We take Fast-HotStuff as an example and obtain the IM-FHS with guarantees of \emph{order keeping}, \emph{bandwidth adaptability} and \emph{over-distribution resistance}. IM-FHS is conducted in a system with up to 256 nodes, and experimental results validate the efficiency of our approach. IM-FHS achieves higher throughput and smaller latency with faulty nodes than Stratus-FHS, the state-of-the-art protocol, and the throughput gain increases as the number of fault nodes. In a system with 100 nodes with 33 faulty nodes, IM-FHS achieves 9 times the throughput of Stratus-FHS while maintaining 1/10 the latency when dealing with maximum resilience against faulty nodes.
- [3] arXiv:2409.19389 [pdf, other]
-
Title: Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network acceleratorComments: neural network accelerator, low-power design, instruction set design, parallel processors, digital twinJournal-ref: IEEE Microelectronics Design & Test Symposium (MDTS 2024) https://ieeexplore.ieee.org/document/10570137Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Why do security cameras, sensors, and siri use cloud servers instead of on-board computation? The lack of very-low-power, high-performance chips greatly limits the ability to field untethered edge devices. We present the NV-1, a new low-power ASIC AI processor that greatly accelerates parallel processing (> 10X) with dramatic reduction in energy consumption (> 100X), via many parallel combined processor-memory units, i.e., a drastically non-von-Neumann architecture, allowing very large numbers of independent processing streams without bottlenecks due to typical monolithic memory. The current initial prototype fab arises from a successful co-development effort between algorithm- and software-driven architectural design and VLSI design realities. An innovative communication protocol minimizes power usage, and data transport costs among nodes were vastly reduced by eliminating the address bus, through local target address matching. Throughout the development process, the software and architecture teams were able to innovate alongside the circuit design team's implementation effort. A digital twin of the proposed hardware was developed early on to ensure that the technical implementation met the architectural specifications, and indeed the predicted performance metrics have now been thoroughly verified in real hardware test data. The resulting device is currently being used in a fielded edge sensor application; additional proofs of principle are in progress demonstrating the proof on the ground of this new real-world extremely low-power high-performance ASIC device.
- [4] arXiv:2409.19488 [pdf, html, other]
-
Title: A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via FaroComments: 13 pages, 16 figures, To appear in Eurosys 2025Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
This paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, "sloppifies" these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro's solution is to "sloppify" (relax) its multiple design components to achieve fast adaptation without overly degrading solution quality. Faro is implemented in a stack consisting of Ray Serve running atop a Kubernetes cluster. Trace-driven cluster deployments show that Faro achieves 2.3$\times$-23$\times$ lower SLO violations compared to state-of-the-art systems.
- [5] arXiv:2409.19564 [pdf, html, other]
-
Title: Hamster: A Fast Synchronous Byzantine Fault Tolerance ProtocolSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
This paper introduces Hamster, a novel synchronous Byzantine Fault Tolerance protocol that achieves better performance and has weaker dependency on synchrony. Specifically, Hamster employs coding techniques to significantly decrease communication complexity and addresses coding related security issues. Consequently, Hamster achieves a throughput gain that increases linearly with the number of nodes, compared to Sync HotStuff. By adjusting the block size, Hamster outperforms Sync HotStuff in terms of both throughput and latency. Moreover, With minor modifications, Hamster can also function effectively in mobile sluggish environments, further reducing its dependency on strict synchrony. We implement Hamster and the experimental results demonstrate its performance advantages. Specifically, Hamster's throughput in a network of $9$ nodes is $2.5\times$ that of Sync HotStuff, and this gain increases to $10$ as the network scales to $65$ nodes.
- [6] arXiv:2409.19653 [pdf, other]
-
Title: Data-Centric Design: Introducing An Informatics Domain Model And Core Data Ontology For Computational SystemsComments: Proceedings of the 9th International Conference on Data Mining & Knowledge Management (DaKM 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The Core Data Ontology (CDO) and the Informatics Domain Model represent a transformative approach to computational systems, shifting from traditional node-centric designs to a data-centric paradigm. This paper introduces a framework where data is categorized into four modalities: objects, events, concepts, and actions. This quadrimodal structure enhances data security, semantic interoperability, and scalability across distributed data ecosystems. The CDO offers a comprehensive ontology that supports AI development, role-based access control, and multimodal data management. By focusing on the intrinsic value of data, the Informatics Domain Model redefines system architectures to prioritize data security, provenance, and auditability, addressing vulnerabilities in current models. The paper outlines the methodology for developing the CDO, explores its practical applications in fields such as AI, robotics, and legal compliance, and discusses future directions for scalable, decentralized, and interoperable data ecosystems.
- [7] arXiv:2409.19869 [pdf, other]
-
Title: Edge Intelligence in Satellite-Terrestrial Networks with Hybrid Quantum ComputingSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
This paper exploits the potential of edge intelligence empowered satellite-terrestrial networks, where users' computation tasks are offloaded to the satellites or terrestrial base stations. The computation task offloading in such networks involves the edge cloud selection and bandwidth allocations for the access and backhaul links, which aims to minimize the energy consumption under the delay and satellites' energy constraints. To address it, an alternating direction method of multipliers (ADMM)-inspired algorithm is proposed to decompose the joint optimization problem into small-scale subproblems. Moreover, we develop a hybrid quantum double deep Q-learning (DDQN) approach to optimize the edge cloud selection. This novel deep reinforcement learning architecture enables that classical and quantum neural networks process information in parallel. Simulation results confirm the efficiency of the proposed algorithm, and indicate that duality gap is tiny and a larger reward can be generated from a few data points compared to the classical DDQN.
- [8] arXiv:2409.20148 [pdf, html, other]
-
Title: Pragma driven shared memory parallelism in Zig by supporting OpenMP loop directivesComments: Accepted to the the Tenth Annual Workshop on the LLVM Compiler Infrastructure in HPCSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
The Zig programming language, which is designed to provide performance and safety as first class concerns, has become popular in recent years. Given that Zig is built upon LLVM, and-so enjoys many of the benefits provided by the ecosystem, including access to a rich set of backends, Zig has significant potential for high performance workloads. However, it is yet to gain acceptance in HPC and one of the reasons for this is that support for the pragma driven shared memory parallelism is missing.
In this paper we describe enhancing the Zig compiler to add support for OpenMP loop directives. Then exploring performance using NASA's NAS Parallel Benchmark (NPB) suite. We demonstrate that not only does our integration of OpenMP with Zig scale comparatively to Fortran and C reference implementations of NPB, but furthermore Zig provides up to a 1.25 times performance increase compared to Fortran. - [9] arXiv:2409.20247 [pdf, html, other]
-
Title: Resource Allocation for Stable LLM Training in Mobile Edge ComputingComments: This paper appears in the 2024 International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Systems and Control (eess.SY); Optimization and Control (math.OC)
As mobile devices increasingly become focal points for advanced applications, edge computing presents a viable solution to their inherent computational limitations, particularly in deploying large language models (LLMs). However, despite the advancements in edge computing, significant challenges remain in efficient training and deploying LLMs due to the computational demands and data privacy concerns associated with these models. This paper explores a collaborative training framework that integrates mobile users with edge servers to optimize resource allocation, thereby enhancing both performance and efficiency. Our approach leverages parameter-efficient fine-tuning (PEFT) methods, allowing mobile users to adjust the initial layers of the LLM while edge servers handle the more demanding latter layers. Specifically, we formulate a multi-objective optimization problem to minimize the total energy consumption and delay during training. We also address the common issue of instability in model performance by incorporating stability enhancements into our objective function. Through novel fractional programming technique, we achieve a stationary point for the formulated problem. Simulations demonstrate that our method reduces the energy consumption as well as the latency, and increases the reliability of LLMs across various mobile settings.
- [10] arXiv:2409.20476 [pdf, html, other]
-
Title: Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCLSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend popular communication libraries to support GPU awareness and more recently, GPU-initiated operations. In this paper, we present Intel SHMEM, a library that enables users to write programs that are GPU aware, in that API calls support GPU memory, and also support GPU-initiated communication operations by embedding OpenSHMEM style calls within GPU kernels. We also propose thread-collaborative extensions to the OpenSHMEM standard that can enable users to better exploit the strengths of GPUs. Our implementation adapts to choose between direct load/store from GPU and the GPU copy engine based transfer to optimize performance on different configurations.
New submissions (showing 10 of 10 entries)
- [11] arXiv:2409.19032 (cross-list from cs.CR) [pdf, html, other]
-
Title: A Systematisation of Knowledge: Connecting European Digital Identities with Web3Comments: Conference paperJournal-ref: 2024 IEEE International Conference on Blockchain (Blockchain), Copenhagen, Denmark, 2024, pp. 605-610Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
The terms self-sovereign identity (SSI) and decentralised identity are often used interchangeably, which results in increasing ambiguity when solutions are being investigated and compared. This article aims to provide a clear distinction between the two concepts in relation to the revised Regulation as Regards establishing the European Digital Identity Framework (eIDAS 2.0) by providing a systematisation of knowledge of technological developments that led up to implementation of eIDAS 2.0. Applying an inductive exploratory approach, relevant literature was selected iteratively in waves over a nine months time frame and covers literature between 2005 and 2024. The review found that the decentralised identity sector emerged adjacent to the OpenID Connect (OIDC) paradigm of Open Authentication, whereas SSI denotes the sector's shift towards blockchain-based solutions. In this study, it is shown that the interchangeable use of SSI and decentralised identity coincides with novel protocols over OIDC. While the first part of this paper distinguishes OIDC from decentralised identity, the second part addresses the incompatibility between OIDC under eIDAS 2.0 and Web3. The paper closes by suggesting further research for establishing a digital identity bridge for connecting applications on public-permissionless ledgers with data originating from eIDAS 2.0 and being presented using OIDC.
- [12] arXiv:2409.19256 (cross-list from cs.LG) [pdf, html, other]
-
Title: HybridFlow: A Flexible and Efficient RLHF FrameworkGuangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan WuSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53$\times$~20.57$\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code is available at this https URL.
- [13] arXiv:2409.19302 (cross-list from cs.CR) [pdf, html, other]
-
Title: Leveraging MTD to Mitigate Poisoning Attacks in Decentralized FL with Non-IID DataChao Feng, Alberto Huertas Celdrán, Zien Zeng, Zi Ye, Jan von der Assen, Gerome Bovet, Burkhard StillerSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Decentralized Federated Learning (DFL), a paradigm for managing big data in a privacy-preserved manner, is still vulnerable to poisoning attacks where malicious clients tamper with data or models. Current defense methods often assume Independently and Identically Distributed (IID) data, which is unrealistic in real-world applications. In non-IID contexts, existing defensive strategies face challenges in distinguishing between models that have been compromised and those that have been trained on heterogeneous data distributions, leading to diminished efficacy. In response, this paper proposes a framework that employs the Moving Target Defense (MTD) approach to bolster the robustness of DFL models. By continuously modifying the attack surface of the DFL system, this framework aims to mitigate poisoning attacks effectively. The proposed MTD framework includes both proactive and reactive modes, utilizing a reputation system that combines metrics of model similarity and loss, alongside various defensive techniques. Comprehensive experimental evaluations indicate that the MTD-based mechanism significantly mitigates a range of poisoning attack types across multiple datasets with different topologies.
- [14] arXiv:2409.19509 (cross-list from cs.LG) [pdf, html, other]
-
Title: Heterogeneity-Aware Resource Allocation and Topology Design for Hierarchical Federated Edge LearningComments: 12 pages, 9 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) provides a privacy-preserving framework for training machine learning models on mobile edge devices. Traditional FL algorithms, e.g., FedAvg, impose a heavy communication workload on these devices. To mitigate this issue, Hierarchical Federated Edge Learning (HFEL) has been proposed, leveraging edge servers as intermediaries for model aggregation. Despite its effectiveness, HFEL encounters challenges such as a slow convergence rate and high resource consumption, particularly in the presence of system and data heterogeneity. However, existing works are mainly focused on improving training efficiency for traditional FL, leaving the efficiency of HFEL largely unexplored. In this paper, we consider a two-tier HFEL system, where edge devices are connected to edge servers and edge servers are interconnected through peer-to-peer (P2P) edge backhauls. Our goal is to enhance the training efficiency of the HFEL system through strategic resource allocation and topology design. Specifically, we formulate an optimization problem to minimize the total training latency by allocating the computation and communication resources, as well as adjusting the P2P connections. To ensure convergence under dynamic topologies, we analyze the convergence error bound and introduce a model consensus constraint into the optimization problem. The proposed problem is then decomposed into several subproblems, enabling us to alternatively solve it online. Our method facilitates the efficient implementation of large-scale FL at edge networks under data and system heterogeneity. Comprehensive experiment evaluation on benchmark datasets validates the effectiveness of the proposed method, demonstrating significant reductions in training latency while maintaining the model accuracy compared to various baselines.
- [15] arXiv:2409.19756 (cross-list from cs.CR) [pdf, other]
-
Title: Advances in Privacy Preserving Federated Learning to Realize a Truly Learning Healthcare SystemSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
The concept of a learning healthcare system (LHS) envisions a self-improving network where multimodal data from patient care are continuously analyzed to enhance future healthcare outcomes. However, realizing this vision faces significant challenges in data sharing and privacy protection. Privacy-Preserving Federated Learning (PPFL) is a transformative and promising approach that has the potential to address these challenges by enabling collaborative learning from decentralized data while safeguarding patient privacy. This paper proposes a vision for integrating PPFL into the healthcare ecosystem to achieve a truly LHS as defined by the Institute of Medicine (IOM) Roundtable.
- [16] arXiv:2409.20123 (cross-list from cs.CR) [pdf, html, other]
-
Title: DBNode: A Decentralized Storage System for Big Data Storage in Consortium BlockchainsSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Storing big data directly on a blockchain poses a substantial burden due to the need to maintain a consistent ledger across all nodes. Numerous studies in decentralized storage systems have been conducted to tackle this particular challenge. Most state-of-the-art research concentrates on developing a general storage system that can accommodate diverse blockchain categories. However, it is essential to recognize the unique attributes of a consortium blockchain, such as data privacy and access control. Beyond ensuring high performance, these specific needs are often overlooked by general storage systems. This paper proposes a decentralized storage system for Hyperledger Fabric, which is a well-known consortium blockchain. First, we employ erasure coding to partition files, subsequently organizing these chunks into a hierarchical structure that fosters efficient and dependable data storage. Second, we design a two-layer hash-slots mechanism and a mirror strategy, enabling high data availability. Third, we design an access control mechanism based on a smart contract to regulate file access.
- [17] arXiv:2409.20135 (cross-list from cs.LG) [pdf, html, other]
-
Title: Federated Instruction Tuning of LLMs with Domain Coverage AugmentationSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Domain-specific Instruction Tuning (FedDIT) leverages a few cross-client private data and server-side public data for instruction augmentation, enhancing model performance in specific domains. While the factors affecting FedDIT remain unclear and existing instruction augmentation methods mainly focus on the centralized setting without considering the distributed environment. Firstly, our experiments show that cross-client domain coverage, rather than data heterogeneity, drives model performance in FedDIT. Thus, we propose FedDCA, which maximizes domain coverage through greedy client center selection and retrieval-based augmentation. To reduce client-side computation, FedDCA$^*$ uses heterogeneous encoders with server-side feature alignment. Extensive experiments across four domains (code, medical, financial, and mathematical) validate the effectiveness of both methods. Additionally, we explore the privacy protection against memory extraction attacks with various amounts of public data and results show that there is no significant correlation between the amount of public data and the privacy-preserving capability. However, as the fine-tuning round increases, the risk of privacy leakage reduces or converges.
Cross submissions (showing 7 of 7 entries)
- [18] arXiv:2207.07493 (replaced) [pdf, html, other]
-
Title: Communication-Efficient Diffusion Strategy for Performance Improvement of Federated Learning with Non-IID DataSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
In 6G mobile communication systems, various AI-based network functions and applications have been standardized. Federated learning (FL) is adopted as the core learning architecture for 6G systems to avoid privacy leakage from mobile user data. However, in FL, users with non-independent and identically distributed (non-IID) datasets can deteriorate the performance of the global model because the convergence direction of the gradient for each dataset is different, thereby inducing a weight divergence problem. To address this problem, we propose a novel diffusion strategy for machine learning (ML) models (FedDif) to maximize the performance of the global model with non-IID data. FedDif enables the local model to learn different distributions before parameter aggregation by passing the local models through users via device-to-device communication. Furthermore, we theoretically demonstrate that FedDif can circumvent the weight-divergence problem. Based on this theory, we propose a communication-efficient diffusion strategy for ML models that can determine the trade-off between learning performance and communication cost using auction theory. The experimental results show that FedDif improves the top-1 test accuracy by up to 34.89\% and reduces communication costs by 14.6% to a maximum of 63.49%.
- [19] arXiv:2304.08917 (replaced) [pdf, html, other]
-
Title: Coefficient Synthesis for Threshold AutomataSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Threshold automata are a formalism for modeling fault-tolerant distributed algorithms. The main feature of threshold automata is the notion of a threshold guard, which allows us to compare the number of received messages with the total number of different types of processes. In this paper, we consider the coefficient synthesis problem for threshold automata, in which we are given a sketch of a threshold automaton (with some of the constants in the threshold guards left unspecified) and a violation describing a collection of undesirable behaviors. We then want to synthesize a set of constants which when plugged into the sketch, gives a threshold automaton that does not have the undesirable behaviors. Our main result is that this problem is undecidable, even when the violation is given by a coverability property and the underlying sketch is acyclic.
We then consider the bounded coefficient synthesis problem, in which a bound on the constants to be synthesized is also provided. Though this problem is known to be in the second level of the polynomial hierarchy for coverability properties, the algorithm for this problem involves an exponential-sized encoding of the reachability relation into existential Presburger arithmetic. In this paper, we give a polynomial-sized encoding for this relation. We also provide a tight complexity lower bound for this problem against coverability properties. Finally, motivated by benchmarks appearing from the literature, we also consider a special class of threshold automata and prove that the complexity decreases in this case. - [20] arXiv:2304.10640 (replaced) [pdf, html, other]
-
Title: On the Effects of Data Heterogeneity on the Convergence Rates of Distributed Linear System SolversComments: 16 pages, 6 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We consider the problem of solving a large-scale system of linear equations in a distributed or federated manner by a taskmaster and a set of machines, each possessing a subset of the equations. We provide a comprehensive comparison of two well-known classes of algorithms used to solve this problem: projection-based methods and optimization-based methods. First, we introduce a novel geometric notion of data heterogeneity called angular heterogeneity and discuss its generality. Using this notion, we characterize the optimal convergence rates of the most prominent algorithms from each class, capturing the effects of the number of machines, the number of equations, and that of both cross-machine and local data heterogeneity on these rates. Our analysis establishes the superiority of Accelerated Projected Consensus in realistic scenarios with significant data heterogeneity and offers several insights into how angular heterogeneity affects the efficiency of the methods studied. Additionally, we develop distributed algorithms for the efficient computation of the proposed angular heterogeneity metrics. Our extensive numerical analyses validate and complement our theoretical results.
- [21] arXiv:2305.15000 (replaced) [pdf, other]
-
Title: Chasing the Speed of Light: Low-Latency Planetary-Scale Adaptive Byzantine ConsensusComments: 17 pages, accepted at the 25th ACM/IFIP International Middleware Conference 2024 conferenceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Blockchain technology sparked renewed interest in planetary-scale Byzantine fault-tolerant (BFT) state machine replication (SMR). While recent works predominantly focused on improving the scalability and throughput of these protocols, few of them addressed latency. We present Mercury, a novel transformation to autonomously optimize the latency of quorum-based BFT consensus. Mercury employs a dual resilience threshold that enables faster transaction ordering when the system contains few faulty replicas. Mercury allows forming compact quorums that substantially accelerate consensus using a smaller resilience threshold. Nevertheless, Mercury upholds standard SMR safety and liveness guarantees with optimal resilience, thanks to its judicious use of a dual operation mode and BFT forensics techniques. Our experiments spread tens of replicas across continents and reveal that Mercury can order transactions with finality in less than 0.4 seconds, half the time of a PBFT-like protocol (optimal in terms of number of communication steps and resilience) in the same network. Furthermore, Mercury matches the latency of running its base protocol on theoretically optimal internet links (transmitting at 67% of the speed of light).
- [22] arXiv:2306.04221 (replaced) [pdf, html, other]
-
Title: Dynamic Probabilistic Reliable BroadcastSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Byzantine reliable broadcast is a fundamental primitive in distributed systems that allows a set of processes to agree on a message broadcast by a dedicated process, even when some of them are malicious (Byzantine). It guarantees that no two correct processes deliver different messages, and if a message is delivered by a correct process, every correct process eventually delivers one. Byzantine reliable broadcast protocols are known to scale poorly, as they require $\Omega(n^2)$ message exchanges, where $n$ is the number of system members. The quadratic cost can be explained by the inherent need for every process to relay a message to every other process. In this paper, we explore ways to overcome this limitation, by casting the problem to the probabilistic setting. We propose a solution in which every broadcast message is validated by a small set of witnesses, which allows us to maintain low latency and small communication complexity. In order to tolerate the slow adaptive adversary, we dynamically select the witnesses through a novel stream-local hash function: given a stream of inputs, it generates a stream of output hashed values that adapts to small deviations of the inputs. Our performance analysis shows that the proposed solution exhibits significant scalability gains over state-of-the-art protocols.
- [23] arXiv:2407.11432 (replaced) [pdf, html, other]
-
Title: Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific ComputingHaochen Pan, Ryan Chard, Sicheng Zhou, Alok Kamatar, Rafael Vescovi, Valérie Hayot-Sasson, André Bauer, Maxime Gonthier, Kyle Chard, Ian FosterComments: 12 pages and 8 figures. Camera-ready version for FTXS'24 (this https URL)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available Triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus' ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.
- [24] arXiv:2409.06066 (replaced) [pdf, other]
-
Title: A Thorough Investigation of Content-Defined Chunking Algorithms for Data DeduplicationComments: Submitted to IEEE Transactions on Cloud Computing for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
- [25] arXiv:2301.05872 (replaced) [pdf, html, other]
-
Title: CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved ConvergenceComments: 16 pages, 8 figuresSubjects: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized { stochastic gradient descent (SGD)} for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(n{C^3}/(1-\lambda_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3{C^6}/(1-\lambda_2)^4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. In particular, CEDAS exhibits the shortest transient times when $C < \mathcal{O}(1/(1 - \lambda_2)^2)$, which is common in practice. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.
- [26] arXiv:2311.04824 (replaced) [pdf, other]
-
Title: Multi-Relational Algebra and Its Applications to Data InsightsXi Wu, Zichen Zhu, Xiangyao Yu, Shaleen Deep, Stratis Viglas, John Cieslewicz, Somesh Jha, Jeffrey F. NaughtonSubjects: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
A range of data insight analytical tasks involves analyzing a large set of tables of different schemas, possibly induced by various groupings, to find salient patterns. This paper presents Multi-Relational Algebra, an extension of the classic Relational Algebra, to facilitate such transformations and their compositions. Multi-Relational Algebra has two main characteristics: (1) Information Unit. The information unit is a slice $(r, X)$, where $r$ is a (region) tuple, and $X$ is a (feature) table. Specifically, a slice can encompass multiple columns, which surpasses the information unit of "a single tuple" or "a group of tuples of one column" in the classic relational algebra, (2) Schema Flexibility. Slices can have varying schemas, not constrained to a single schema. This flexibility further expands the expressive power of the algebra. Through various examples, we show that multi-relational algebra can effortlessly express many complex analytic problems, some of which are beyond the scope of traditional relational analytics. We have implemented and deployed a service for multi-relational analytics. Due to a unified logical design, we are able to conduct systematic optimization for a variety of seemingly different tasks. Our service has garnered interest from numerous internal teams who have developed data-insight applications using it, and serves millions of operators daily.
- [27] arXiv:2408.03220 (replaced) [pdf, html, other]
-
Title: Masked Random Noise for Communication Efficient Federated LearningShiwei Li, Yingyi Cheng, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Dugang Liu, Xiuqiang He, Ruixuan LiComments: Accepted by MM 2024Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters within predefined random noise. For this purpose, we propose Federated Masked Random Noise (FedMRN), a novel framework that enables clients to learn a 1-bit mask for each model parameter and apply masked random noise (i.e., the Hadamard product of random noise and masks) to represent model updates. To make FedMRN feasible, we propose an advanced mask training strategy, called progressive stochastic masking (PSM). After local training, each client only need to transmit local masks and a random seed to the server. Additionally, we provide theoretical guarantees for the convergence of FedMRN under both strongly convex and non-convex assumptions. Extensive experiments are conducted on four popular datasets. The results show that FedMRN exhibits superior convergence speed and test accuracy compared to relevant baselines, while attaining a similar level of accuracy as FedAvg.
- [28] arXiv:2408.04836 (replaced) [pdf, html, other]
-
Title: Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific ExplorationComments: 9 pages, accepted by IEEE VIS 2024Subjects: Computational Geometry (cs.CG); Distributed, Parallel, and Cluster Computing (cs.DC)
Contour trees describe the topology of level sets in scalar fields and are widely used in topological data analysis and visualization. A main challenge of utilizing contour trees for large-scale scientific data is their computation at scale using high-performance computing. To address this challenge, recent work has introduced distributed hierarchical contour trees for distributed computation and storage of contour trees. However, effective use of these distributed structures in analysis and visualization requires subsequent computation of geometric properties and branch decomposition to support contour extraction and exploration. In this work, we introduce distributed algorithms for augmentation, hypersweeps, and branch decomposition that enable parallel computation of geometric properties, and support the use of distributed contour trees as query structures for scientific exploration. We evaluate the parallel performance of these algorithms and apply them to identify and extract important contours for scientific visualization.
- [29] arXiv:2408.08699 (replaced) [pdf, html, other]
-
Title: RBLA: Rank-Based-LoRA-Aggregation for Fine-tuning Heterogeneous Models in FLaaSSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) is a promising privacy-aware distributed learning framework that can be deployed on various devices, such as mobile phones, desktops, and devices equipped with CPUs or GPUs. In the context of server-based Federated Learning as a Service (FLaaS), FL enables a central server to coordinate the training process across multiple devices without direct access to local data, thereby enhancing privacy and data security. Low-Rank Adaptation (LoRA) is a method that efficiently fine-tunes models by focusing on a low-dimensional subspace of the model's parameters. This approach significantly reduces computational and memory costs compared to fine-tuning all parameters from scratch. When integrated with FL, particularly in a FLaaS environment, LoRA allows for flexible and efficient deployment across diverse hardware with varying computational capabilities by adjusting the local model's rank. However, in LoRA-enabled FL, different clients may train models with varying ranks, which poses challenges for model aggregation on the server. Current methods for aggregating models of different ranks involve padding weights to a uniform shape, which can degrade the global model's performance. To address this issue, we propose Rank-Based LoRA Aggregation (RBLA), a novel model aggregation method designed for heterogeneous LoRA structures. RBLA preserves key features across models with different ranks. This paper analyzes the issues with current padding methods used to reshape models for aggregation in a FLaaS environment. Then, we introduce RBLA, a rank-based aggregation method that maintains both low-rank and high-rank features. Finally, we demonstrate the effectiveness of RBLA through comparative experiments with state-of-the-art methods.
- [30] arXiv:2409.00605 (replaced) [pdf, html, other]
-
Title: Average-case optimization analysis for distributed consensus algorithms on regular graphsSubjects: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC)
The consensus problem in distributed computing involves a network of agents aiming to compute the average of their initial vectors through local communication, represented by an undirected graph. This paper focuses on the studying of this problem using an average-case analysis approach, particularly over regular graphs. Traditional algorithms for solving the consensus problem often rely on worst-case performance evaluation scenarios, which may not reflect typical performance in real-world applications. Instead, we apply average-case analysis, focusing on the expected spectral distribution of eigenvalues to obtain a more realistic view of performance. Key contributions include deriving the optimal method for consensus on regular graphs, showing its relation to the Heavy Ball method, analyzing its asymptotic convergence rate, and comparing it to various first-order methods through numerical experiments.