Computer Science
- [1] arXiv:2405.19337 [pdf, ps, other]
-
Title: Information-theoretic language of proteinoid gels: Boolean gates and QR codesComments: 7 pages, 5 figuresSubjects: Emerging Technologies (cs.ET); Information Theory (cs.IT); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
With an aim to build analog computers out of soft matter fluidic systems in future, this work attempts to invent a new information-theoretic language, in the form of two-dimensional Quick Response (QR) codes. This language is, effectively, a digital representation of the analog signals shown by the proteinoids. We use two different experimental techniques: (i) a voltage-sensitive dye and (ii) a pair of differential electrodes, to record the analog signals. The analog signals are digitally approximatied (synthesised) by sampling the analog signals into a series of discrete values, which are then converted into binary representations. We have shown the AND-OR-NOT-XOR-NOR-NAND-XNOR gate representation of the digitally sampled signal of proteinoids. Additional encoding schemes are applied to convert the binary code identified above to a two-dimensional QR code. As a result, the QR code becomes a digital, unique marker of a given proteinoid network. We show that it is possible to retrieve the analog signal from the QR code by scanning the QR code using a mobile phone. Our work shows that soft matter fluidic systems, such as proteinoids, can have a fundamental informatiom-theoretic language, unique to their internal information transmission properties (electrical activity in this case) - such a language can be made universal and accessible to everyone using 2D QR codes, which can digitally encode their internal properties and give an option to recover the original signal when required. On a more fundamental note, this study identifies the techniques of approximating continuum properties of soft matter fluidic systems using a series representation of gates and QR codes, which are a piece-wise digital representation, and thus one step closer to programming the fluids using information-theoretic methods, as suggested almost a decade ago by Tao's fluid program.
- [2] arXiv:2405.19339 [pdf, ps, other]
-
Title: MidSurfer: A Parameter-Free Approach for Mid-Surface Extraction from Segmented Volumetric DataEva Boneš, Dawar Khan, Ciril Bohak, Benjamin A. Barad, Danielle A. Grotjahn, Ivan Viola, Thomas TheußlSubjects: Graphics (cs.GR); Computational Geometry (cs.CG)
In the field of volumetric data processing and analysis, extracting mid-surfaces from thinly bounded compartments is crucial for tasks such as surface area estimation and accurate modeling of biological structures, yet it has lacked a standardized approach. To bridge this gap, we introduce MidSurfer--a novel parameter-free method for extracting mid-surfaces from segmented volumetric data. Our method produces smooth, uniformly triangulated meshes that accurately capture the structural features of interest. The process begins with the Ridge Field Transformation step that transforms the segmented input data, followed by the Mid-Polyline Extraction Algorithm that works on individual volume slices. Based on the connectivity of components, this step can result in either single or multiple polyline segments that represent the structural features. These segments form a coherent series across the volume, creating a backbone of regularly distributed points on each slice that represents the mid-surface. Subsequently, we employ a Polyline Zipper Algorithm for triangulation that connects these polyline segments across neighboring slices, yielding a detailed triangulated mid-surface mesh. Our findings demonstrate that this method surpasses previous techniques in versatility, simplicity of use, and accuracy. Our approach is now publicly available as a plugin for ParaView, a widely-used multi-platform tool for data analysis and visualization, and can be found at this https URL .
- [3] arXiv:2405.19342 [pdf, ps, html, other]
-
Title: Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice AssistantsChloé Sekkat, Fanny Leroy, Salima Mdhaffar, Blake Perry Smith, Yannick Estève, Joseph Dureau, Alice CouckeSubjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
- [4] arXiv:2405.19343 [pdf, ps, html, other]
-
Title: Luganda Speech Intent Recognition for IoT ApplicationsComments: Presented as a conference paper at ICLR 2024/AfricaNLPSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
The advent of Internet of Things (IoT) technology has generated massive interest in voice-controlled smart homes. While many voice-controlled smart home systems are designed to understand and support widely spoken languages like English, speakers of low-resource languages like Luganda may need more support. This research project aimed to develop a Luganda speech intent classification system for IoT applications to integrate local languages into smart home environments. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The Raspberry Pi processes Luganda voice commands, the Wio Terminal is a display device, and the ESP32 nodes control the IoT devices. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi. The NLP model utilized Mel Frequency Cepstral Coefficients (MFCCs) as acoustic features and a Convolutional Neural Network (Conv2D) architecture for speech intent classification. A dataset of Luganda voice commands was curated for this purpose and this has been made open-source. This work addresses the localization challenges and linguistic diversity in IoT applications by incorporating Luganda voice commands, enabling users to interact with smart home devices without English proficiency, especially in regions where local languages are predominant.
- [5] arXiv:2405.19355 [pdf, ps, html, other]
-
Title: Enhancing Trust and Security in the Vehicular Metaverse: A Reputation-Based Mechanism for Participants with Moral HazardComments: Accepted in WCNC 2024Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
In this paper, we tackle the issue of moral hazard within the realm of the vehicular Metaverse. A pivotal facilitator of the vehicular Metaverse is the effective orchestration of its market elements, primarily comprised of sensing internet of things (SIoT) devices. These SIoT devices play a critical role by furnishing the virtual service provider (VSP) with real-time sensing data, allowing for the faithful replication of the physical environment within the virtual realm. However, SIoT devices with intentional misbehavior can identify a loophole in the system post-payment and proceeds to deliver falsified content, which cause the whole vehicular Metaverse to collapse. To combat this significant problem, we propose an incentive mechanism centered around a reputation-based strategy. Specifically, the concept involves maintaining reputation scores for participants based on their interactions with the VSP. These scores are derived from feedback received by the VSP from Metaverse users regarding the content delivered by the VSP and are managed using a subjective logic model. Nevertheless, to prevent ``good" SIoT devices with false positive ratings to leave the Metaverse market, we build a vanishing-like system of previous ratings so that the VSP can make informed decisions based on the most recent and accurate data available. Finally, we validate our proposed model through extensive simulations. Our primary results show that our mechanism can efficiently prevent malicious devices from starting their poisoning attacks. At the same time, trustworthy SIoT devices that had a previous miss-classification are not banned from the market.
- [6] arXiv:2405.19358 [pdf, ps, html, other]
-
Title: Robustifying Safety-Aligned Large Language Models through Clean Data CurationSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training and direct tampering with LLMs through fine-tuning. In both scenarios, adversaries can compromise the safety alignment of LLMs, exacerbating malfunctions. Motivated by the need to mitigate these adversarial influences, our research aims to enhance safety alignment by either neutralizing the impact of malicious texts in pre-training datasets or increasing the difficulty of jailbreaking during downstream fine-tuning. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios. Our method operates under the assumption that we have no prior knowledge of attack details, focusing solely on curating clean texts. We introduce an iterative process aimed at revising texts to reduce their perplexity as perceived by LLMs, while simultaneously preserving their text quality. By pre-training or fine-tuning LLMs with curated clean texts, we observe a notable improvement in LLM robustness regarding safety alignment against harmful queries. For instance, when pre-training LLMs using a crowdsourced dataset containing 5\% harmful instances, adding an equivalent amount of curated texts significantly mitigates the likelihood of providing harmful responses in LLMs and reduces the attack success rate by 71\%. Our study represents a significant step towards mitigating the risks associated with training-based jailbreaking and fortifying the secure utilization of LLMs.
- [7] arXiv:2405.19360 [pdf, ps, html, other]
-
Title: ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in this https URL.
- [8] arXiv:2405.19369 [pdf, ps, html, other]
-
Title: Sublinear Cuts are the Exception in BDF-GIRGsSubjects: Social and Information Networks (cs.SI); Combinatorics (math.CO); Metric Geometry (math.MG)
The introduction of geometry has proven instrumental in the efforts towards more realistic models for real-world networks. In Geometric Inhomogeneous Random Graphs (GIRGs), Euclidean Geometry induces clustering of the vertices, which is widely observed in networks in the wild. Euclidean Geometry in multiple dimensions however restricts proximity of vertices to those cases where vertices are close in each coordinate. We introduce a large class of GIRG extensions, called BDF-GIRGs, which capture arbitrary hierarchies of the coordinates within the distance function of the vertex feature space. These distance functions have the potential to allow more realistic modeling of the complex formation of social ties in real-world networks, where similarities between people lead to connections. Here, similarity with respect to certain features, such as familial kinship or a shared workplace, suffices for the formation of ties. It is known that - while many key properties of GIRGs, such as log-log average distance and sparsity, are independent of the distance function - the Euclidean metric induces small separators, i.e. sublinear cuts of the unique giant component in GIRGs, whereas no such sublinear separators exist under the component-wise minimum distance. Building on work of Lengler and Todorović, we give a complete classification for the existence of small separators in BDF-GIRGs. We further show that BDF-GIRGs all fulfill a stochastic triangle inequality and thus also exhibit clustering.
- [9] arXiv:2405.19375 [pdf, ps, html, other]
-
Title: Improving global awareness of linkset predictions using Cross-Attentive Modulation tokensComments: 17 pages, 2 figures, not published nor submitted yetSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
Most of multiple link prediction or graph generation techniques rely on the attention mechanism or on Graph Neural Networks (GNNs), which consist in leveraging node-level information exchanges in order to form proper link predictions. Such node-level interactions do not process nodes as an ordered sequence, which would imply some kind of natural ordering of the nodes: they are said to be permutation invariant mechanisms. They are well suited for graph problems, but struggle at providing a global orchestration of the predicted links, which can result in a loss of performance. Some typical issues can be the difficulty to ensure high-level properties such as global connectedness, fixed diameter or to avoid information bottleneck effects such as oversmoothing and oversquashing, which respectively consist in abundant smoothing in dense areas leading to a loss of information and a tendency to exclude isolated nodes from the message passing scheme, and often result in irrelevant, unbalanced link predictions. To tackle this problem, we hereby present Cross-Attentive Modulation (CAM) tokens, which introduce cross-attentive units used to condition node and edge-level modulations in order to enable context-aware computations that improve the global consistency of the prediction links. We will implement it on a few permutation invariant architectures, and showcase benchmarks that prove the merits of our work.
- [10] arXiv:2405.19376 [pdf, ps, html, other]
-
Title: PureEBM: Universal Poison Purification via Mid-Run Dynamics of Energy-Based ModelsComments: arXiv admin note: substantial text overlap with arXiv:2405.18627Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Data poisoning attacks pose a significant threat to the integrity of machine learning models by leading to misclassification of target distribution test data by injecting adversarial examples during training. Existing state-of-the-art (SoTA) defense methods suffer from a variety of limitations, such as significantly reduced generalization performance, specificity to particular attack types and classifiers, and significant overhead during training, making them impractical or limited for real-world applications. In response to this challenge, we introduce a universal data purification method that defends naturally trained classifiers from malicious white-, gray-, and black-box image poisons by applying a universal stochastic preprocessing step $\Psi_{T}(x)$, realized by iterative Langevin sampling of a convergent Energy Based Model (EBM) initialized with an image $x.$ Mid-run dynamics of $\Psi_{T}(x)$ purify poison information with minimal impact on features important to the generalization of a classifier network. We show that the contrastive learning process of EBMs allows them to remain universal purifiers, even in the presence of poisoned EBM training data, and to achieve SoTA defense on leading triggered poison Narcissus and triggerless poisons Gradient Matching and Bullseye Polytope. This work is a subset of a larger framework introduced in PureGen with a more detailed focus on EBM purification and poison defense.
- [11] arXiv:2405.19377 [pdf, ps, html, other]
-
Title: HoloDevice: Holographic Cross-Device Interactions for Remote CollaborationSubjects: Human-Computer Interaction (cs.HC)
This paper introduces holographic cross-device interaction, a new class of remote cross-device interactions between local physical devices and holographically rendered remote devices. Cross-device interactions have enabled a rich set of interactions with device ecologies. Most existing research focuses on co-located settings (meaning when users and devices are in the same physical space) to achieve these rich interactions and affordances. In contrast, holographic cross-device interaction allows remote interactions between devices at distant locations by providing a rich visual affordance through real-time holographic rendering of the device's motion, content, and interactions on mixed reality head-mounted displays. This maintains the advantages of having a physical device, such as precise input through touch and pen interaction. Through holographic rendering, not only can remote devices interact as if they are co-located, but they can also be virtually augmented to further enrich interactions, going beyond what is possible with existing cross-device systems. To demonstrate this concept, we developed HoloDevice, a prototype system for holographic cross-device interaction using the Microsoft Hololens 2 augmented reality headset. Our contribution is threefold. First, we introduce the concept of holographic cross-device interaction. Second, we present a design space containing three unique benefits, which include: (1) spatial visualization of interaction and motion, (2) rich visual affordances for intermediate transition, and (3) dynamic and fluid configuration. Last we discuss a set of implementation demonstrations and use-case scenarios that further explore the space.
- [12] arXiv:2405.19383 [pdf, ps, html, other]
-
Title: Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental EvaluationSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
Money laundering presents a pervasive challenge, burdening society by financing illegal activities. To more effectively combat and detect money laundering, the use of network information is increasingly being explored, exploiting that money laundering necessarily involves interconnected parties. This has lead to a surge in literature on network analytics (NA) for anti-money laundering (AML). The literature, however, is fragmented and a comprehensive overview of existing work is missing. This results in limited understanding of the methods that may be applied and their comparative detection power. Therefore, this paper presents an extensive and systematic review of the literature. We identify and analyse 97 papers in the Web of Science and Scopus databases, resulting in a taxonomy of approaches following the fraud analytics framework of Bockel-Rickermann et al.. Moreover, this paper presents a comprehensive experimental framework to evaluate and compare the performance of prominent NA methods in a uniform setup. The framework is applied on the publicly available Elliptic data set and implements manual feature engineering, random walk-based methods, and deep learning GNNs. We conclude from the results that network analytics increases the predictive power of the AML model with graph neural networks giving the best results. An open source implementation of the experimental framework is provided to facilitate researchers and practitioners to extend upon these results and experiment on proprietary data. As such, we aim to promote a standardised approach towards the analysis and evaluation of network analytics for AML.
- [13] arXiv:2405.19387 [pdf, ps, html, other]
-
Title: Video Anomaly Detection in 10 Years: A Survey and OutlookSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including large-scale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to shaping the future of VAD research.
- [14] arXiv:2405.19413 [pdf, ps, html, other]
-
Title: VisTA-SR: Improving the Accuracy and Resolution of Low-Cost Thermal Imaging Cameras for AgricultureSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Thermal cameras are an important tool for agricultural research because they allow for non-invasive measurement of plant temperature, which relates to important photochemical, hydraulic, and agronomic traits. Utilizing low-cost thermal cameras can lower the barrier to introducing thermal imaging in agricultural research and production. This paper presents an approach to improve the temperature accuracy and image quality of low-cost thermal imaging cameras for agricultural applications. Leveraging advancements in computer vision techniques, particularly deep learning networks, we propose a method, called $\textbf{VisTA-SR}$ ($\textbf{Vis}$ual \& $\textbf{T}$hermal $\textbf{A}$lignment and $\textbf{S}$uper-$\textbf{R}$esolution Enhancement) that combines RGB and thermal images to enhance the capabilities of low-resolution thermal cameras. The research includes calibration and validation of temperature measurements, acquisition of paired image datasets, and the development of a deep learning network tailored for agricultural thermal imaging. Our study addresses the challenges of image enhancement in the agricultural domain and explores the potential of low-cost thermal cameras to replace high-resolution industrial cameras. Experimental results demonstrate the effectiveness of our approach in enhancing temperature accuracy and image sharpness, paving the way for more accessible and efficient thermal imaging solutions in agriculture.
- [15] arXiv:2405.19414 [pdf, ps, html, other]
-
Title: Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement LearningComments: 9 pages, 3 figuresSubjects: Machine Learning (cs.LG)
Designing Reinforcement Learning (RL) solutions for real-life problems remains a significant challenge. A major area of concern is safety. "Shielding" is a popular technique to enforce safety in RL by turning user-defined safety specifications into safe agent behavior. However, these methods either suffer from extreme learning delays, demand extensive human effort in designing models and safe domains in the problem, or require pre-computation. In this paper, we propose a new permissibility-based framework to deal with safety and shield construction. Permissibility was originally designed for eliminating (non-permissible) actions that will not lead to an optimal solution to improve RL training efficiency. This paper shows that safety can be naturally incorporated into this framework, i.e. extending permissibility to include safety, and thereby we can achieve both safety and improved efficiency. Experimental evaluation using three standard RL applications shows the effectiveness of the approach.
- [16] arXiv:2405.19420 [pdf, ps, html, other]
-
Title: Using Contrastive Learning with Generative Similarity to Learn Spaces that Capture Human Inductive BiasesRaja Marjieh, Sreejan Kumar, Declan Campbell, Liyi Zhang, Gianluca Bencomo, Jake Snell, Thomas L. GriffithsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Humans rely on strong inductive biases to learn from few examples and abstract useful information from sensory data. Instilling such biases in machine learning models has been shown to improve their performance on various benchmarks including few-shot learning, robustness, and alignment. However, finding effective training procedures to achieve that goal can be challenging as psychologically-rich training data such as human similarity judgments are expensive to scale, and Bayesian models of human inductive biases are often intractable for complex, realistic domains. Here, we address this challenge by introducing a Bayesian notion of generative similarity whereby two datapoints are considered similar if they are likely to have been sampled from the same distribution. This measure can be applied to complex generative processes, including probabilistic programs. We show that generative similarity can be used to define a contrastive learning objective even when its exact form is intractable, enabling learning of spatial embeddings that express specific inductive biases. We demonstrate the utility of our approach by showing how it can be used to capture human inductive biases for geometric shapes, and to better distinguish different abstract drawing styles that are parameterized by probabilistic programs.
- [17] arXiv:2405.19423 [pdf, ps, html, other]
-
Title: Evaluating Vision-Language Models on Bistable ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.
- [18] arXiv:2405.19424 [pdf, ps, html, other]
-
Title: Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based PoliciesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion models (DMs) have emerged as a promising approach for behavior cloning (BC). Diffusion policies (DP) based on DMs have elevated BC performance to new heights, demonstrating robust efficacy across diverse tasks, coupled with their inherent flexibility and ease of implementation. Despite the increasing adoption of DP as a foundation for policy generation, the critical issue of safety remains largely unexplored. While previous attempts have targeted deep policy networks, DP used diffusion models as the policy network, making it ineffective to be attacked using previous methods because of its chained structure and randomness injected. In this paper, we undertake a comprehensive examination of DP safety concerns by introducing adversarial scenarios, encompassing offline and online attacks, and global and patch-based attacks. We propose DP-Attacker, a suite of algorithms that can craft effective adversarial attacks across all aforementioned scenarios. We conduct attacks on pre-trained diffusion policies across various manipulation tasks. Through extensive experiments, we demonstrate that DP-Attacker has the capability to significantly decrease the success rate of DP for all scenarios. Particularly in offline scenarios, DP-Attacker can generate highly transferable perturbations applicable to all frames. Furthermore, we illustrate the creation of adversarial physical patches that, when applied to the environment, effectively deceive the model. Video results are put in: this https URL.
- [19] arXiv:2405.19425 [pdf, ps, html, other]
-
Title: Adaptive In-conversation Team Building for Language Model AgentsSubjects: Computation and Language (cs.CL)
Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks, while the effective design of multiple agents for a particular application remains an art. It is thus intriguing to answer a critical question: Given a task, how can we build a team of LLM agents to solve it effectively? Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. It dynamically forms and manages teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and prevent stereotypical outputs. It allows for a flexible yet structured approach to problem-solving and can help reduce redundancy and enhance output diversity. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods with 21.94% improvement in average accuracy, providing outstanding performance without requiring task-specific prompt engineering.
- [20] arXiv:2405.19426 [pdf, ps, html, other]
-
Title: Deep Learning for Assessment of Oral Reading FluencySubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.
- [21] arXiv:2405.19429 [pdf, ps, html, other]
-
Title: Conformal Recursive Feature EliminationMarcos López-De-Castro (1 and 2), Alberto García-Galindo (1 and 2), Rubén Armañanzas (1 and 2) ((1) DATAI - Institute of Data Science and Artificial Intelligence, Universidad de Navarra, Pamplona, Spain,(2) TECNUN School of Engineering, Universidad de Navarra, Donostia-San Sebastian, Spain)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Unlike traditional statistical methods, Conformal Prediction (CP) allows for the determination of valid and accurate confidence levels associated with individual predictions based only on exchangeability of the data. We here introduce a new feature selection method that takes advantage of the CP framework. Our proposal, named Conformal Recursive Feature Elimination (CRFE), identifies and recursively removes features that increase the non-conformity of a dataset. We also present an automatic stopping criterion for CRFE, as well as a new index to measure consistency between subsets of features. CRFE selections are compared to the classical Recursive Feature Elimination (RFE) method on several multiclass datasets by using multiple partitions of the data. The results show that CRFE clearly outperforms RFE in half of the datasets, while achieving similar performance in the rest. The automatic stopping criterion provides subsets of effective and non-redundant features without computing any classification performance.
- [22] arXiv:2405.19430 [pdf, ps, other]
-
Title: Understanding Grasp Synergies during Reach-to-grasp using an Instrumented Data GloveSubjects: Robotics (cs.RO)
Data gloves play a crucial role in study of human grasping, and could provide insights into grasp synergies. Grasp synergies lead to identification of underlying patterns to develop control strategies for hand exoskeletons. This paper presents the design and implementation of a data glove that has been enhanced with instrumentation and fabricated using 3D printing technology. The glove utilizes flexible sensors for the fingers and force sensors integrated into the glove at the fingertips to accurately capture grasp postures and forces. Understanding the kinematics and dynamics of human grasp including reach-to-grasp is undertaken. A comprehensive study involving 10 healthy subjects was conducted. Grasp synergy analysis is carried out to identify underlying patterns for robotic grasping. The t-SNE visualization showcased clusters of grasp postures and forces, unveiling similarities and patterns among different GTs. These findings could serve as a comprehensive guide in design and control of tendon-driven soft hand exoskeletons for rehabilitation applications, enabling the replication of natural hand movements and grasp forces.
- [23] arXiv:2405.19433 [pdf, ps, other]
-
Title: Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed CounterfactualsSubjects: Computation and Language (cs.CL)
While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.
- [24] arXiv:2405.19438 [pdf, ps, html, other]
-
Title: Towards an Autonomous Minimally Invasive Spinal Fixation Surgery Using a Concentric Tube Steerable Drilling RobotComments: 6 pages, 3 figures, Accepted for publication at the 2024 International Symposium on Medical RoboticsSubjects: Robotics (cs.RO)
Towards performing a realistic autonomous minimally invasive spinal fixation procedure, in this paper, we introduce a unique robotic drilling system utilizing a concentric tube steerable drilling robot (CT-SDR) integrated with a seven degree-of-freedom robotic manipulator. The CT-SDR in integration with the robotic arm enables creating precise J-shape trajectories enabling access to the areas within the vertebral body that currently are not accessible utilizing existing rigid instruments. To ensure safety and accuracy of the autonomous drilling procedure, we also performed required calibration procedures. The performance of the proposed robotic system and the calibration steps were thoroughly evaluated by performing various drilling experiments on simulated Sawbone samples.
- [25] arXiv:2405.19440 [pdf, ps, html, other]
-
Title: On the Convergence of Multi-objective Optimization under Generalized SmoothnessSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
- [26] arXiv:2405.19442 [pdf, ps, html, other]
-
Title: Large-scale DSM registration via motion averagingComments: 9 FiguresJournal-ref: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. X-1-2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Generating wide-area digital surface models (DSMs) requires registering a large number of individual, and partially overlapped DSMs. This presents a challenging problem for a typical registration algorithm, since when a large number of observations from these multiple DSMs are considered, it may easily cause memory overflow. Sequential registration algorithms, although can significantly reduce the computation, are especially vulnerable for small overlapped pairs, leading to a large error accumulation. In this work, we propose a novel solution that builds the DSM registration task as a motion averaging problem: pair-wise DSMs are registered to build a scene graph, with edges representing relative poses between DSMs. Specifically, based on the grid structure of the large DSM, the pair-wise registration is performed using a novel nearest neighbor search method. We show that the scene graph can be optimized via an extremely fast motion average algorithm with O(N) complexity (N refers to the number of images). Evaluation of high-resolution satellite-derived DSM demonstrates significant improvement in computation and accuracy.
- [27] arXiv:2405.19444 [pdf, ps, html, other]
-
Title: MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsSubjects: Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models' abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications.
- [28] arXiv:2405.19449 [pdf, ps, html, other]
-
Title: Tangible Scenography as a Holistic Design Method for Human-Robot InteractionSubjects: Human-Computer Interaction (cs.HC)
Traditional approaches to human-robot interaction design typically examine robot behaviors in controlled environments and narrow tasks. These methods are impractical for designing robots that interact with diverse user groups in complex human environments. Drawing from the field of theater, we present the construct of scenes -- individual environments consisting of specific people, objects, spatial arrangements, and social norms -- and tangible scenography, as a holistic design approach for human-robot interactions. We created a design tool, Tangible Scenography Kit (TaSK), with physical props to aid in design brainstorming. We conducted design sessions with eight professional designers to generate exploratory designs. Designers used tangible scenography and TaSK components to create multiple scenes with specific interaction goals, characterize each scene's social environment, and design scene-specific robot behaviors. From these sessions, we found that this method can encourage designers to think beyond a robot's narrow capabilities and consider how they can facilitate complex social interactions.
- [29] arXiv:2405.19450 [pdf, ps, html, other]
-
Title: FourierMamba: Fourier Learning Integration with State Space Models for Image DerainingSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owning to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation.
- [30] arXiv:2405.19452 [pdf, ps, html, other]
-
Title: Gaitor: Learning a Unified Representation Across Gaits for Real-World Quadruped LocomotionComments: 10 pages, 8 figures, 2 tablesSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
The current state-of-the-art in quadruped locomotion is able to produce robust motion for terrain traversal but requires the segmentation of a desired robot trajectory into a discrete set of locomotion skills such as trot and crawl. In contrast, in this work we demonstrate the feasibility of learning a single, unified representation for quadruped locomotion enabling continuous blending between gait types and characteristics. We present Gaitor, which learns a disentangled representation of locomotion skills, thereby sharing information common to all gait types seen during training. The structure emerging in the learnt representation is interpretable in that it is found to encode phase correlations between the different gait types. These can be leveraged to produce continuous gait transitions. In addition, foot swing characteristics are disentangled and directly addressable. Together with a rudimentary terrain encoding and a learned planner operating in this structured latent representation, Gaitor is able to take motion commands including desired gait type and characteristics from a user while reacting to uneven terrain. We evaluate Gaitor in both simulated and real-world settings on the ANYmal C platform. To the best of our knowledge, this is the first work learning such a unified and interpretable latent representation for multiple gaits, resulting in on-demand continuous blending between different locomotion modes on a real quadruped robot.
- [31] arXiv:2405.19453 [pdf, ps, html, other]
-
Title: Optimizing Split Points for Error-Resilient SplitFed LearningComments: Accepted for poster presentation at the Women in Computer Vision (WiCV) workshop in CVPR 2024Subjects: Artificial Intelligence (cs.AI)
Recent advancements in decentralized learning, such as Federated Learning (FL), Split Learning (SL), and Split Federated Learning (SplitFed), have expanded the potentials of machine learning. SplitFed aims to minimize the computational burden on individual clients in FL and parallelize SL while maintaining privacy. This study investigates the resilience of SplitFed to packet loss at model split points. It explores various parameter aggregation strategies of SplitFed by examining the impact of splitting the model at different points-either shallow split or deep split-on the final global model performance. The experiments, conducted on a human embryo image segmentation task, reveal a statistically significant advantage of a deeper split point.
- [32] arXiv:2405.19454 [pdf, ps, html, other]
-
Title: Deep Grokking: Would Deep Neural Networks Generalize Better?Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.
- [33] arXiv:2405.19456 [pdf, ps, html, other]
-
Title: An Automated Startup Evaluation Pipeline: Startup Success Forecasting Framework (SSFF)Comments: For relevant code: this https URLSubjects: Artificial Intelligence (cs.AI)
Evaluating startups in their early stages is a complex task that requires detailed analysis by experts. While automating this process on a large scale can significantly impact businesses, the inherent complexity poses challenges. This paper addresses this challenge by introducing the Startup Success Forecasting Framework (SSFF), a new automated system that combines traditional machine learning with advanced language models. This intelligent agent-based architecture is designed to reason, act, synthesize, and decide like a venture capitalist to perform the analysis end-to-end. The SSFF is made up of three main parts: - Prediction Block: Uses random forests and neural networks to make predictions. - Analyst Block: Simulates VC analysis scenario and uses SOTA prompting techniques - External Knowledge Block: Gathers real-time information from external sources. This framework requires minimal input data about the founder and startup description, enhances it with additional data from external resources, and performs a detailed analysis with high accuracy, all in an automated manner
- [34] arXiv:2405.19457 [pdf, ps, html, other]
-
Title: Construction of a Byzantine Linearizable SWMR Atomic Register from SWSR Atomic RegistersComments: 18 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
The SWMR atomic register is a fundamental building block in shared memory distributed systems and implementing it from SWSR atomic registers is an important problem. While this problem has been solved in crash-prone systems, it has received less attention in Byzantine systems. Recently, Hu and Toueg gave such an implementation of the SWMR register from SWSR registers. While their definition of register linearizability is consistent with the definition of Byzantine linearizability of a concurrent history of Cohen and Keidar, it has these drawbacks. (1) If the writer is Byzantine, the register is linearizable no matter what values the correct readers return. (2) It ignores values written consistently by a Byzantine writer. We need a stronger notion of a {\em correct write operation}. (3) It allows a value written to just one or a few readers' SWSR registers to be returned, thereby not validating the intention of the writer to write that value honestly. (4) Its notion of a ``current'' value returned by a correct reader is not related to the most recent value written by a correct write operation of a Byzantine writer. We need a more up to date version of the value that can be returned by a correct reader. In this paper, we give a stronger definition of a Byzantine linearizable register that overcomes the above drawbacks. Then we give a construction of a Byzantine linearizable SWMR atomic register from SWSR registers that meets our stronger definition. The construction is correct when $n>3f$, where $n$ is the number of readers, $f$ is the maximum number of Byzantine readers, and the writer can also be Byzantine. The construction relies on a public-key infrastructure.
- [35] arXiv:2405.19458 [pdf, ps, html, other]
-
Title: MemControl: Mitigating Memorization in Medical Diffusion Models via Automated Parameter SelectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Diffusion models show a remarkable ability in generating images that closely mirror the training distribution. However, these models are prone to training data memorization, leading to significant privacy, ethical, and legal concerns, particularly in sensitive fields such as medical imaging. We hypothesize that memorization is driven by the overparameterization of deep models, suggesting that regularizing model capacity during fine-tuning could be an effective mitigation strategy. Parameter-efficient fine-tuning (PEFT) methods offer a promising approach to capacity control by selectively updating specific parameters. However, finding the optimal subset of learnable parameters that balances generation quality and memorization remains elusive. To address this challenge, we propose a bi-level optimization framework that guides automated parameter selection by utilizing memorization and generation quality metrics as rewards. Our framework successfully identifies the optimal parameter set to be updated to satisfy the generation-memorization tradeoff. We perform our experiments for the specific task of medical image generation and outperform existing state-of-the-art training-time mitigation strategies by fine-tuning as few as 0.019% of model parameters. Furthermore, we show that the strategies learned through our framework are transferable across different datasets and domains. Our proposed framework is scalable to large datasets and agnostic to the choice of reward functions. Finally, we show that our framework can be combined with existing approaches for further memorization mitigation.
- [36] arXiv:2405.19460 [pdf, ps, html, other]
-
Title: Evaluating Micro Parsons Problems as Exam QuestionsComments: This work is to appear in ITiCSE 2024. Both authors contributed equally to this researchSubjects: Human-Computer Interaction (cs.HC)
Parsons problems are a type of programming activity that present learners with blocks of existing code and requiring them to arrange those blocks to form a program rather than write the code from scratch. Micro Parsons problems extend this concept by having students assemble segments of code to form a single line of code rather than an entire program. Recent investigations into micro Parsons problems have primarily focused on supporting learners leaving open the question of micro Parsons efficacy as an exam item and how students perceive it when preparing for exams.
To fill this gap, we included a variety of micro Parsons problems on four exams in an introductory programming course taught in Python. We use Item Response Theory to investigate the difficulty of the micro Parsons problems as well as the ability of the questions to differentiate between high and low ability students. We then compare these results to results for related questions where students are asked to write a single line of code from scratch. Finally, we conduct a thematic analysis of the survey responses to investigate how students' perceptions of micro Parsons both when practicing for exams and as they appear on exams. - [37] arXiv:2405.19461 [pdf, ps, html, other]
-
Title: Clustering-Based Validation Splits for Domain GeneralisationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
This paper considers the problem of model selection under domain shift. In this setting, it is proposed that a high maximum mean discrepancy (MMD) between the training and validation sets increases the generalisability of selected models. A data splitting algorithm based on kernel k-means clustering, which maximises this objective, is presented. The algorithm leverages linear programming to control the size, label, and (optionally) group distributions of the splits, and comes with convergence guarantees. The technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation (DG) and unsupervised domain adaptation (UDA) tasks. Analysis also shows the MMD between the training and validation sets to be strongly rank-correlated ($\rho=0.63$) with test domain accuracy, further substantiating the validity of this approach.
- [38] arXiv:2405.19462 [pdf, ps, html, other]
-
Title: Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data PruningComments: Accepted to ACL 2024 FindingsSubjects: Computation and Language (cs.CL)
Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
- [39] arXiv:2405.19464 [pdf, ps, html, other]
-
Title: Leveraging Generative AI for Smart City Digital Twins: A Survey on the Autonomous Generation of Data, Scenarios, 3D City Models, and Urban DesignsSubjects: Artificial Intelligence (cs.AI)
The digital transformation of modern cities by integrating advanced information, communication, and computing technologies has marked the epoch of data-driven smart city applications for efficient and sustainable urban management. Despite their effectiveness, these applications often rely on massive amounts of high-dimensional and multi-domain data for monitoring and characterizing different urban sub-systems, presenting challenges in application areas that are limited by data quality and availability, as well as costly efforts for generating urban scenarios and design alternatives. As an emerging research area in deep learning, Generative Artificial Intelligence (AI) models have demonstrated their unique values in data and code generation. This survey paper aims to explore the innovative integration of generative AI techniques and urban digital twins to address challenges in the realm of smart cities in various urban sectors, such as transportation and mobility management, energy system operations, building and infrastructure management, and urban design. The survey starts with the introduction of popular generative AI models with their application areas, followed by a structured review of the existing urban science applications that leverage the autonomous capability of the generative AI techniques to facilitate (a) data augmentation for promoting urban monitoring and predictive analytics, (b) synthetic data and scenario generation, (c) automated 3D city modeling, and (d) generative urban design and optimization. Based on the review, this survey discusses potential opportunities and technical strategies that integrate generative AI models into the next-generation urban digital twins for more reliable, scalable, and automated management of smart cities.
- [40] arXiv:2405.19465 [pdf, ps, html, other]
-
Title: RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated AdapterMeng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge LiComments: Accepted by ACL 2024 FindingsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.
- [41] arXiv:2405.19466 [pdf, ps, html, other]
-
Title: Posterior Sampling via Autoregressive GenerationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Real-world decision-making requires grappling with a perpetual lack of data as environments change; intelligent agents must comprehend uncertainty and actively gather information to resolve it. We propose a new framework for learning bandit algorithms from massive historical data, which we demonstrate in a cold-start recommendation problem. First, we use historical data to pretrain an autoregressive model to predict a sequence of repeated feedback/rewards (e.g., responses to news articles shown to different users over time). In learning to make accurate predictions, the model implicitly learns an informed prior based on rich action features (e.g., article headlines) and how to sharpen beliefs as more rewards are gathered (e.g., clicks as each article is recommended). At decision-time, we autoregressively sample (impute) an imagined sequence of rewards for each action, and choose the action with the largest average imputed reward. Far from a heuristic, our approach is an implementation of Thompson sampling (with a learned prior), a prominent active exploration algorithm. We prove our pretraining loss directly controls online decision-making performance, and we demonstrate our framework on a news recommendation task where we integrate end-to-end fine-tuning of a pretrained language model to process news article headline text to improve performance.
- [42] arXiv:2405.19471 [pdf, ps, other]
-
Title: The Data Minimization Principle in Machine LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design principles, data minimization has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation. This paper addresses this gap and introduces an optimization framework for data minimization based on its legal definitions. It then adapts several optimization algorithms to perform data minimization and conducts a comprehensive evaluation in terms of their compliance with minimization objectives as well as their impact on user privacy. Our analysis underscores the mismatch between the privacy expectations of data minimization and the actual privacy benefits, emphasizing the need for approaches that account for multiple facets of real-world privacy risks.
- [43] arXiv:2405.19479 [pdf, ps, html, other]
-
Title: Participation in the age of foundation modelsComments: 13 pages, 2 figures. Appeared at FAccT '24Journal-ref: In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 3-6, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 13 pagesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Growing interest and investment in the capabilities of foundation models has positioned such systems to impact a wide array of public services. Alongside these opportunities is the risk that these systems reify existing power imbalances and cause disproportionate harm to marginalized communities. Participatory approaches hold promise to instead lend agency and decision-making power to marginalized stakeholders. But existing approaches in participatory AI/ML are typically deeply grounded in context - how do we apply these approaches to foundation models, which are, by design, disconnected from context? Our paper interrogates this question.
First, we examine existing attempts at incorporating participation into foundation models. We highlight the tension between participation and scale, demonstrating that it is intractable for impacted communities to meaningfully shape a foundation model that is intended to be universally applicable. In response, we develop a blueprint for participatory foundation models that identifies more local, application-oriented opportunities for meaningful participation. In addition to the "foundation" layer, our framework proposes the "subfloor'' layer, in which stakeholders develop shared technical infrastructure, norms and governance for a grounded domain, and the "surface'' layer, in which affected communities shape the use of a foundation model for a specific downstream task. The intermediate "subfloor'' layer scopes the range of potential harms to consider, and affords communities more concrete avenues for deliberation and intervention. At the same time, it avoids duplicative effort by scaling input across relevant use cases. Through three case studies in clinical care, financial services, and journalism, we illustrate how this multi-layer model can create more meaningful opportunities for participation than solely intervening at the foundation layer. - [44] arXiv:2405.19480 [pdf, ps, html, other]
-
Title: RANFusion: A Comprehensive Tool for Simulating Handover In Next-G RANComments: 7 pages, 5 Figure, 2 tableSubjects: Networking and Internet Architecture (cs.NI)
The rapid advancement of 5G networks and the upcoming transition to 6G necessitate the use of the Open Radio Access Network (O-RAN) architecture to enable greater flexibility, interoperability, and innovation. This shift towards 6G and O-RAN requires the development of advanced simulation tools for testing, analyzing, and optimizing Radio Access Network (RAN) operations. This need becomes critical due to the complex dynamics of mobility management inherent in the 6G vision and next-generation networks. These networks anticipate advanced handover methods for mobile users, UAVs, IoT devices, and beyond. Addressing this gap, this paper introduces RANFusion: a robust RAN simulator specifically created to explore a variety of handover scenarios and to test and balance resources between users. This tool enables precise simulations for refining handover strategies within RAN and O-RAN environments, thereby ensuring optimal performance and reliability in these advanced network infrastructures.
- [45] arXiv:2405.19483 [pdf, ps, other]
-
Title: IMEX methods for thin-film equations and Cahn-Hilliard equations with variable mobilityComments: 23 pages, 9 figures, accepted in 2024Journal-ref: Computational Materials Science -2024 - in productionSubjects: Numerical Analysis (math.NA); Mathematical Physics (math-ph)
We explore a class of splitting schemes employing implicit-explicit (IMEX) time-stepping to achieve accurate and energy-stable solutions for thin-film equations and Cahn-Hilliard models with variable mobility. This splitting method incorporates a linear, constant coefficient implicit step, facilitating efficient computational implementation. We investigate the influence of stabilizing splitting parameters on the numerical solution computationally, considering various initial conditions. Furthermore, we generate energy-stability plots for the proposed methods, examining different choices of splitting parameter values and timestep sizes. These methods enhance the accuracy of the original bi-harmonic-modified (BHM) approach, while preserving its energy-decreasing property and achieving second-order accuracy. We present numerical experiments to illustrate the performance of the proposed methods.
- [46] arXiv:2405.19487 [pdf, ps, other]
-
Title: A Full-duplex Speech Dialogue Scheme Based On Large Language ModelsSubjects: Computation and Language (cs.CL)
We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.
- [47] arXiv:2405.19491 [pdf, ps, html, other]
-
Title: Calibration and Validation of a Phase-Field Model of Brittle Fracture within the Damage Mechanics ChallengeJonas Heinzmann, Pietro Carrara, Chenyi Luo, Manav Manav, Akanksha Mishra, Sindhu Nagaraja, Hamza Oudich, Francesco Vicentini, Laura De LorenzisSubjects: Computational Engineering, Finance, and Science (cs.CE)
In the context of the Damage Mechanics Challenge, we adopt a phase-field model of brittle fracture to blindly predict the behavior up to failure of a notched three-point-bending specimen loaded under mixed-mode conditions. The beam is additively manufactured using a geo-architected gypsum based on the combination of bassanite and a water-based binder. The calibration of the material parameters involved in the model is based on a set of available independent experimental tests and on a two-stage procedure. In the first stage an estimate of most of the elastic parameters is obtained, whereas the remaining parameters are optimized in the second stage so as to minimize the discrepancy between the numerical predictions and a set of experimental results on notched three-point-bending beams. The good agreement between numerical predictions and experimental results in terms of load-displacement curves and crack paths demonstrates the predictive ability of the model and the reliability of the calibration procedure.
- [48] arXiv:2405.19493 [pdf, ps, html, other]
-
Title: Fast Gaussian Distributed Pseudorandom Number Generation in Java via the Ziggurat AlgorithmSubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Probability (math.PR)
We report on experiments with the ziggurat algorithm for generating Gaussian distributed random numbers. The study utilizes our open source Java implementation that was introduced originally for Java 11 at a time when the Java API only provided the much slower polar method. Our Java implementation of the ziggurat algorithm is a port of the GNU Scientific Library's C implementation. Java 17 introduced a significant overhaul of pseudorandom number generation, including several modern pseudorandom number generators (PRNGs) as well as additional functionality, among which includes switching from the polar method to a modified ziggurat algorithm. In the experiments of this paper, we explore whether there is still a need for our implementation for Java 17+ applications. Our results show that Java 17's modified ziggurat is faster than our implementation for the PRNGs that support it. However, Java 17+ continues to use the polar method for the legacy PRNGs Random, SecureRandom, and ThreadLocalRandom. The linear congruential method of Java's Random class lacks the statistical properties required by Java's modified ziggurat implementation; and SecureRandom and ThreadLocalRandom unfortunately use the polar method as a side-effect of extending Random. Our implementation of the original ziggurat algorithm does not require the same statistical properties of the underlying PRNG as Java 17's optimized version, and can be used with any of these PRNGs, and is especially relevant where pre-Java 17 support is required.
- [49] arXiv:2405.19498 [pdf, ps, html, other]
-
Title: Machine Psychology: Integrating Operant Conditioning with the Non-Axiomatic Reasoning System for Advancing Artificial General Intelligence ResearchSubjects: Artificial Intelligence (cs.AI)
This paper introduces an interdisciplinary framework called Machine Psychology, which merges principles from operant learning psychology with a specific Artificial Intelligence model, the Non-Axiomatic Reasoning System (NARS), to enhance Artificial General Intelligence (AGI) research. The core premise of this framework is that adaptation is crucial to both biological and artificial intelligence and can be understood through operant conditioning principles. The study assesses this approach via three operant learning tasks using OpenNARS for Applications (ONA): simple discrimination, changing contingencies, and conditional discrimination tasks.
In the simple discrimination task, NARS demonstrated rapid learning, achieving perfect accuracy during both training and testing phases. The changing contingencies task showcased NARS's adaptability, as it successfully adjusted its behavior when task conditions were reversed. In the conditional discrimination task, NARS handled complex learning scenarios effectively, achieving high accuracy by forming and utilizing intricate hypotheses based on conditional cues.
These findings support the application of operant conditioning as a framework for creating adaptive AGI systems. NARS's ability to operate under conditions of insufficient knowledge and resources, coupled with its sensorimotor reasoning capabilities, establishes it as a robust model for AGI. The Machine Psychology framework, by incorporating elements of natural intelligence such as continuous learning and goal-driven behavior, offers a scalable and flexible approach for real-world applications. Future research should investigate using enhanced NARS systems, more advanced tasks, and applying this framework to diverse, complex challenges to further progress the development of human-level AI. - [50] arXiv:2405.19499 [pdf, ps, html, other]
-
Title: Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous EnvironmentsJournal-ref: Proceedings of the 41st International Conference on Machine Learning, 2024 LearningSubjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
We explore a Federated Reinforcement Learning (FRL) problem where $N$ agents collaboratively learn a common policy without sharing their trajectory data. To date, existing FRL work has primarily focused on agents operating in the same or ``similar" environments. In contrast, our problem setup allows for arbitrarily large levels of environment heterogeneity. To obtain the optimal policy which maximizes the average performance across all potentially completely different environments, we propose two algorithms: FedSVRPG-M and FedHAPG-M. In contrast to existing results, we demonstrate that both FedSVRPG-M and FedHAPG-M, both of which leverage momentum mechanisms, can exactly converge to a stationary point of the average performance function, regardless of the magnitude of environment heterogeneity. Furthermore, by incorporating the benefits of variance-reduction techniques or Hessian approximation, both algorithms achieve state-of-the-art convergence results, characterized by a sample complexity of $\mathcal{O}\left(\epsilon^{-\frac{3}{2}}/N\right)$. Notably, our algorithms enjoy linear convergence speedups with respect to the number of agents, highlighting the benefit of collaboration among agents in finding a common policy.
- [51] arXiv:2405.19500 [pdf, ps, html, other]
-
Title: The Numerical Solution of the External Dirichlet Generalized Harmonic Problem for a Sphere by the Method of Probabilistic SolutionMamuli Zakradze, Zaza Tabagari, Nana Koblishvili, Tinatin Davitashvili, Jose Maria Sanchez, Francisco Criado-AldeanuevaJournal-ref: Mathematics 11, 3 (2023)Subjects: Numerical Analysis (math.NA)
In the present paper, an algorithm for the numerical solution of the external Dirichlet generalized harmonic problem for a sphere by the method of probabilistic solution (MPS) is given, where generalized indicates that a boundary function has a finite number of first kind discontinuity curves. The algorithm consists of the following main stages: (1) the transition from an infinite domain to a finite domain by an inversion; (2) the consideration of a new Dirichlet generalized harmonic problem on the basis of Kelvin theorem for the obtained finite domain; (3) the numerical solution of the new problem for the finite domain by the MPS, which in turn is based on a computer simulation of the Weiner process; (4) finding the probabilistic solution of the posed generalized problem at any fixed points of the infinite domain by the solution of the new problem. For illustration, numerical examples are considered and results are presented.
- [52] arXiv:2405.19501 [pdf, ps, html, other]
-
Title: MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision TransformerSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.
- [53] arXiv:2405.19504 [pdf, ps, html, other]
-
Title: MUVERA: Multi-Vector Retrieval via Fixed Dimensional EncodingsSubjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR)
Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding $x \in \mathbb{R}^d$ per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring.
In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality $\epsilon$-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5$\times$ fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10$\%$ improved recall with $90\%$ lower latency. - [54] arXiv:2405.19507 [pdf, ps, html, other]
-
Title: Data-Efficient Discovery of Hyperelastic TPMS Metamaterials with Extreme Energy DissipationSubjects: Graphics (cs.GR)
Triply periodic minimal surfaces (TPMS) are a class of metamaterials with a variety of applications and well-known primitives. We present a new method for discovering novel microscale TPMS structures with exceptional energy-dissipation capabilities, achieving double the energy absorption of the best existing TPMS primitive structure. Our approach employs a parametric representation, allowing seamless interpolation between structures and representing a rich TPMS design space. We show that simulations are intractable for optimizing microscale hyperelastic structures, and instead propose a sample-efficient computational strategy for rapidly discovering structures with extreme energy dissipation using limited amounts of empirical data from 3D-printed and tested microscale metamaterials. This strategy ensures high-fidelity results but involves time-consuming 3D printing and testing. To address this, we leverage an uncertainty-aware Deep Ensembles model to predict microstructure behaviors and identify which structures to 3D-print and test next. We iteratively refine our model through batch Bayesian optimization, selecting structures for fabrication that maximize exploration of the performance space and exploitation of our energy-dissipation objective. Using our method, we produce the first open-source dataset of hyperelastic microscale TPMS structures, including a set of novel structures that demonstrate extreme energy dissipation capabilities. We show several potential applications of these structures in protective equipment and bone implants.
- [55] arXiv:2405.19509 [pdf, ps, html, other]
-
Title: Leveraging partial stragglers within gradient codingComments: 12 pages, 7 figuresSubjects: Information Theory (cs.IT)
Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of $\nabla L$ (gradient of the loss function $L$). However, in large-scale clusters, many workers are slower than their promised speed or even failure-prone. A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover $\nabla L$ (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. In this work, we present novel gradient coding protocols that judiciously leverage the work performed by partial stragglers. Our protocols are efficient from a computation and communication perspective and numerically stable. For an important class of chunk assignments, we present efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time. For exact gradient reconstruction, our protocol is around $2\times$ faster than the original class of protocols and for approximate gradient reconstruction, the mean-squared-error of our reconstructed gradient is several orders of magnitude better.
- [56] arXiv:2405.19513 [pdf, ps, html, other]
-
Title: Decentralized Optimization in Time-Varying Networks with Arbitrary DelaysComments: arXiv admin note: text overlap with arXiv:2401.11344Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider a decentralized optimization problem for networks affected by communication delays. Examples of such networks include collaborative machine learning, sensor networks, and multi-agent systems. To mimic communication delays, we add virtual non-computing nodes to the network, resulting in directed graphs. This motivates investigating decentralized optimization solutions on directed graphs. Existing solutions assume nodes know their out-degrees, resulting in limited applicability. To overcome this limitation, we introduce a novel gossip-based algorithm, called DT-GO, that does not need to know the out-degrees. The algorithm is applicable in general directed networks, for example networks with delays or limited acknowledgment capabilities. We derive convergence rates for both convex and non-convex objectives, showing that our algorithm achieves the same complexity order as centralized Stochastic Gradient Descent. In other words, the effects of the graph topology and delays are confined to higher-order terms. Additionally, we extend our analysis to accommodate time-varying network topologies. Numerical simulations are provided to support our theoretical findings.
- [57] arXiv:2405.19514 [pdf, ps, html, other]
-
Title: Wavefront Threading Enables Effective High-Level SynthesisBlake Pelton, Adam Sapek, Ken Eguro, Daniel Lo, Alessandro Forin, Matt Humphrey, Jinwen Xi, David Cox, Rajas Karandikar, Johannes de Fine Licht, Evgeny Babin, Adrian Caulfield, Doug BurgerComments: Accepted to PLDI'24Subjects: Programming Languages (cs.PL)
Digital systems are growing in importance and computing hardware is growing more heterogeneous. Hardware design, however, remains laborious and expensive, in part due to the limitations of conventional hardware description languages (HDLs) like VHDL and Verilog. A longstanding research goal has been programming hardware like software, with high-level languages that can generate efficient hardware designs. This paper describes Kanagawa, a language that takes a new approach to combine the programmer productivity benefits of traditional High-Level Synthesis (HLS) approaches with the expressibility and hardware efficiency of Register-Transfer Level (RTL) design. The language's concise syntax, matched with a hardware design-friendly execution model, permits a relatively simple toolchain to map high-level code into efficient hardware implementations.
- [58] arXiv:2405.19519 [pdf, ps, html, other]
-
Title: Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit dataSudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed SarkerSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.
- [59] arXiv:2405.19521 [pdf, ps, html, other]
-
Title: Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous ItemsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model's goodness of fit with posterior predictive checks, the Bayesian analogue of $\chi^2$ tests. Dawid and Skene's model is rejected by goodness of fit tests, whereas our new model, which adjusts for item heterogeneity, is not rejected. We illustrate our new model with two well-studied data sets, binary rating data for caries in dental X-rays and implication in natural language.
- [60] arXiv:2405.19522 [pdf, ps, other]
-
Title: Artificial Intelligence Index Report 2024Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, Jack ClarkSubjects: Artificial Intelligence (cs.AI)
The 2024 Index is our most comprehensive to date and arrives at an important moment when AI's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and an entirely new chapter dedicated to AI's impact on science and medicine. The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.
- [61] arXiv:2405.19524 [pdf, ps, html, other]
-
Title: AI Risk Management Should Incorporate Both Safety and SecurityXiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek MittalSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this paper, we advocate that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security, and unambiguously take into account the perspectives of both disciplines in order to devise mostly effective and holistic risk mitigation approaches. Unfortunately, this vision is often obfuscated, as the definitions of the basic concepts of "safety" and "security" themselves are often inconsistent and lack consensus across communities. With AI risk management being increasingly cross-disciplinary, this issue is particularly salient. In light of this conceptual challenge, we introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security, aiming to facilitate a shared understanding and effective collaboration across communities.
- [62] arXiv:2405.19525 [pdf, ps, html, other]
-
Title: Lifelong Learning Using a Dynamically Growing Tree of Sub-networks for Domain Generalization in Video Object SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Current state-of-the-art video object segmentation models have achieved great success using supervised learning with massive labeled training datasets. However, these models are trained using a single source domain and evaluated using videos sampled from the same source domain. When these models are evaluated using videos sampled from a different target domain, their performance degrades significantly due to poor domain generalization, i.e., their inability to learn from multi-domain sources simultaneously using traditional supervised learning. In this paper, We propose a dynamically growing tree of sub-networks (DGT) to learn effectively from multi-domain sources. DGT uses a novel lifelong learning technique that allows the model to continuously and effectively learn from new domains without forgetting the previously learned domains. Hence, the model can generalize to out-of-domain videos. The proposed work is evaluated using single-source in-domain (traditional video object segmentation), multi-source in-domain, and multi-source out-of-domain video object segmentation. The results of DGT show a single source in-domain performance gain of 0.2% and 3.5% on the DAVIS16 and DAVIS17 datasets, respectively. However, when DGT is evaluated using in-domain multi-sources, the results show superior performance compared to state-of-the-art video object segmentation and other lifelong learning techniques with an average performance increase in the F-score of 6.9% with minimal catastrophic forgetting. Finally, in the out-of-domain experiment, the performance of DGT is 2.7% and 4% better than state-of-the-art in 1 and 5-shots, respectively.
- [63] arXiv:2405.19527 [pdf, ps, other]
-
Title: Flexible Agent-based Modeling Framework to Evaluate Integrated Microtransit and Fixed-route Transit Designs: Mode Choice, Supernetworks, and Fleet SimulationComments: 49 pages, 25 figures, 8 tables; Submitted To: Transportation Research Part C: Emerging Technologies on May 1st, 2024Subjects: Systems and Control (eess.SY)
The integration of traditional fixed-route transit (FRT) and more flexible microtransit has been touted as a means of improving mobility and access to opportunity, increasing transit ridership, and promoting environmental sustainability. To help evaluate integrated FRT and microtransit public transit (PT) system (henceforth ``integrated fixed-flex PT system'') designs, we propose a high-fidelity modeling framework that provides reliable estimates for a wide range of (i) performance metrics and (ii) integrated fixed-flex PT system designs. We formulate the mode choice equilibrium problem as a fixed-point problem wherein microtransit demand is a function of microtransit performance, and microtransit performance depends on microtransit demand. We propose a detailed agent-based simulation modeling framework that includes (i) a binary logit mode choice model (private auto vs. transit), (ii) a supernetwork-based model and pathfinding algorithm for multi-modal transit path choice where the supernetwork includes pedestrian, FRT, and microtransit layers, (iii) a detailed mobility-on-demand fleet simulator called FleetPy to model the supply-demand dynamics of the microtransit service. In this paper, we illustrate the capabilities of the modeling framework by analyzing integrated fixed-flex PT system designs that vary the following design parameters: FRT frequencies and microtransit fleet size, service region structure, virtual stop coverage, and operating hours. We include case studies in downtown San Diego and Lemon Grove, California. The computational results show that the proposed modeling framework converges to a mode choice equilibrium. Moreover, the scenario results imply that introducing a new microtransit service decreases FRT ridership and requires additional subsidies, but it significantly increases job accessibility and slightly reduces total VMT.
- [64] arXiv:2405.19528 [pdf, ps, html, other]
-
Title: Predicting Long-Term Human Behaviors in Discrete Representations via Physics-Guided DiffusionSubjects: Robotics (cs.RO)
Long-term human trajectory prediction is a challenging yet critical task in robotics and autonomous systems. Prior work that studied how to predict accurate short-term human trajectories with only unimodal features often failed in long-term prediction. Reinforcement learning provides a good solution for learning human long-term behaviors but can suffer from challenges in data efficiency and optimization. In this work, we propose a long-term human trajectory forecasting framework that leverages a guided diffusion model to generate diverse long-term human behaviors in a high-level latent action space, obtained via a hierarchical action quantization scheme using a VQ-VAE to discretize continuous trajectories and the available context. The latent actions are predicted by our guided diffusion model, which uses physics-inspired guidance at test time to constrain generated multimodal action distributions. Specifically, we use reachability analysis during the reverse denoising process to guide the diffusion steps toward physically feasible latent actions. We evaluate our framework on two publicly available human trajectory forecasting datasets: SFU-Store-Nav and JRDB, and extensive experimental results show that our framework achieves superior performance in long-term human trajectory forecasting.
- [65] arXiv:2405.19531 [pdf, ps, html, other]
-
Title: Real-Time Dynamic Robot-Assisted Hand-Object Interaction via Motion PrimitivesComments: 8 pages, 10 figuresSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Advances in artificial intelligence (AI) have been propelling the evolution of human-robot interaction (HRI) technologies. However, significant challenges remain in achieving seamless interactions, particularly in tasks requiring physical contact with humans. These challenges arise from the need for accurate real-time perception of human actions, adaptive control algorithms for robots, and the effective coordination between human and robotic movements. In this paper, we propose an approach to enhancing physical HRI with a focus on dynamic robot-assisted hand-object interaction (HOI). Our methodology integrates hand pose estimation, adaptive robot control, and motion primitives to facilitate human-robot collaboration. Specifically, we employ a transformer-based algorithm to perform real-time 3D modeling of human hands from single RGB images, based on which a motion primitives model (MPM) is designed to translate human hand motions into robotic actions. The robot's action implementation is dynamically fine-tuned using the continuously updated 3D hand models. Experimental validations, including a ring-wearing task, demonstrate the system's effectiveness in adapting to real-time movements and assisting in precise task executions.
- [66] arXiv:2405.19532 [pdf, ps, html, other]
-
Title: Contrasting Multiple Representations with the Multi-Marginal Matching GapComments: To be presented at ICML 2024Subjects: Machine Learning (cs.LG)
Learning meaningful representations of complex objects that can be seen through multiple ($k\geq 3$) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to $k$ views, either by instantiating $\tfrac12k(k-1)$ loss-pairs, or by using reduced embeddings, following a \textit{one vs. average-of-rest} strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all $k$ views. Given a batch of $n$ points, each seen as a $k$-tuple of views subsequently transformed into $k$ embeddings, our loss contrasts the cost of matching these $n$ ground-truth $k$-tuples with the MM-OT polymatching cost, which seeks $n$ optimally arranged $k$-tuples chosen within these $n\times k$ vectors. While the exponential complexity $O(n^k$) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., $k=3\sim 6$ views using mini-batches of size $64~\sim128$. Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.
- [67] arXiv:2405.19534 [pdf, ps, html, other]
-
Title: Preference Learning Algorithms Do Not Learn Preference RankingsAngelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun ChoSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
- [68] arXiv:2405.19538 [pdf, ps, html, other]
-
Title: CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and PatientsPierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, Curtis P. LanglotzComments: 13 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: this https URL Models are available at the following URL: this https URL
- [69] arXiv:2405.19540 [pdf, ps, html, other]
-
Title: Computing Low-Entropy Couplings for Large-Support DistributionsSamuel Sokota, Dylan Sam, Christian Schroeder de Witt, Spencer Compton, Jakob Foerster, J. Zico KolterSubjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)
Minimum-entropy coupling (MEC) -- the process of finding a joint distribution with minimum entropy for given marginals -- has applications in areas such as causality and steganography. However, existing algorithms are either computationally intractable for large-support distributions or limited to specific distribution types and sensitive to hyperparameter choices. This work addresses these limitations by unifying a prior family of iterative MEC (IMEC) approaches into a generalized partition-based formalism. From this framework, we derive a novel IMEC algorithm called ARIMEC, capable of handling arbitrary discrete distributions, and introduce a method to make IMEC robust to suboptimal hyperparameter settings. These innovations facilitate the application of IMEC to high-throughput steganography with language models, among other settings. Our codebase is available at this https URL .
- [70] arXiv:2405.19544 [pdf, ps, html, other]
-
Title: One-Shot Safety Alignment for Large Language Models via Optimal DualizationSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, thus greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based scenarios (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness of our methods.
- [71] arXiv:2405.19547 [pdf, ps, html, other]
-
Title: CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive LearningComments: This paper supercedes our previous VAS paper (arXiv:2402.02055)Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\cite{gadre2023datacomp}. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3\% improvement on ImageNet-1k and a 2.8\% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~\cite{fang2023data} and HYPE~\cite{kim2024hype}, we can boost average performance on downstream tasks by 0.9\%, achieving a new state-of-the-art.
- [72] arXiv:2405.19548 [pdf, ps, html, other]
-
Title: RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement LearningComments: 25 pages, 19 figuresSubjects: Machine Learning (cs.LG)
Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. The source code for RLeXplore is available at this https URL.
- [73] arXiv:2405.19550 [pdf, ps, html, other]
-
Title: Stress-Testing Capability Elicitation With Password-Locked ModelsSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models' (hidden) capabilities exceed those of human demonstrators.
- [74] arXiv:2405.19554 [pdf, ps, html, other]
-
Title: Numerical analysis of a 1/2-equation model of turbulenceSubjects: Numerical Analysis (math.NA)
The recent 1/2-equation model of turbulence is a simplification of the standard Kolmogorov-Prandtl 1-equation URANS model. Surprisingly, initial numerical tests indicated that the 1/2-equation model produces comparable velocity statistics at reduced cost. It is also a test problem and first step for developing numerical analysis to address a full 1-equation model. This report begins the numerical analysis of the 1/2 equation model. Stability, convergence and error estimates are proven for a semi-discrete and fully discrete approximation. Finally, numerical tests are conducted to validate our convergence theory.
- [75] arXiv:2405.19559 [pdf, ps, html, other]
-
Title: Clustering Mixtures of Discrete Distributions: A Note on Mitra's AlgorithmSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this note, we provide a refined analysis of Mitra's algorithm \cite{mitra2008clustering} for classifying general discrete mixture distribution models. Built upon spectral clustering \cite{mcsherry2001spectral}, this algorithm offers compelling conditions for probability distributions. We enhance this analysis by tailoring the model to bipartite stochastic block models, resulting in more refined conditions. Compared to those derived in \cite{mitra2008clustering}, our improved separation conditions are obtained.
- [76] arXiv:2405.19561 [pdf, ps, html, other]
-
Title: Quo Vadis ChatGPT? From Large Language Models to Large Knowledge ModelsSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The startling success of ChatGPT and other large language models (LLMs) using transformer-based generative neural network architecture in applications such as natural language processing and image synthesis has many researchers excited about potential opportunities in process systems engineering (PSE). The almost human-like performance of LLMs in these areas is indeed very impressive, surprising, and a major breakthrough. Their capabilities are very useful in certain tasks, such as writing first drafts of documents, code writing assistance, text summarization, etc. However, their success is limited in highly scientific domains as they cannot yet reason, plan, or explain due to their lack of in-depth domain knowledge. This is a problem in domains such as chemical engineering as they are governed by fundamental laws of physics and chemistry (and biology), constitutive relations, and highly technical knowledge about materials, processes, and systems. Although purely data-driven machine learning has its immediate uses, the long-term success of AI in scientific and engineering domains would depend on developing hybrid AI systems that use first principles and technical knowledge effectively. We call these hybrid AI systems Large Knowledge Models (LKMs), as they will not be limited to only NLP-based techniques or NLP-like applications. In this paper, we discuss the challenges and opportunities in developing such systems in chemical engineering.
- [77] arXiv:2405.19562 [pdf, ps, html, other]
-
Title: Selective ExplanationsSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers and their high-quality counterparts.
- [78] arXiv:2405.19563 [pdf, ps, html, other]
-
Title: Unlearning Climate Misinformation in Large Language ModelsMichael Fore, Simranjit Singh, Chaehong Lee, Amritanshu Pandey, Antonios Anastasopoulos, Dimitrios StamoulisSubjects: Computation and Language (cs.CL)
Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model's responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.
- [79] arXiv:2405.19567 [pdf, ps, html, other]
-
Title: Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical GroundingShenghuan Sun, Gregory M. Goldgof, Alexander Schubert, Zhiqing Sun, Thomas Hartvigsen, Atul J. Butte, Ahmed AlaaComments: Code available at: this https URLSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.
- [80] arXiv:2405.19568 [pdf, ps, html, other]
-
Title: Organizing Background to Explore Latent Classes for Incremental Few-shot Semantic SegmentationComments: 10 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
The goal of incremental Few-shot Semantic Segmentation (iFSS) is to extend pre-trained segmentation models to new classes via few annotated images without access to old training data. During incrementally learning novel classes, the data distribution of old classes will be destroyed, leading to catastrophic forgetting. Meanwhile, the novel classes have only few samples, making models impossible to learn the satisfying representations of novel classes. For the iFSS problem, we propose a network called OINet, i.e., the background embedding space \textbf{O}rganization and prototype \textbf{I}nherit Network. Specifically, when training base classes, OINet uses multiple classification heads for the background and sets multiple sub-class prototypes to reserve embedding space for the latent novel classes. During incrementally learning novel classes, we propose a strategy to select the sub-class prototypes that best match the current learning novel classes and make the novel classes inherit the selected prototypes' embedding space. This operation allows the novel classes to be registered in the embedding space using few samples without affecting the distribution of the base classes. Results on Pascal-VOC and COCO show that OINet achieves a new state of the art.
- [81] arXiv:2405.19569 [pdf, ps, html, other]
-
Title: Improved Convex Decomposition with Ensembling and Boolean PrimitivesComments: 15 pages, 8 figures, 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Describing a scene in terms of primitives -- geometrically simple shapes that offer a parsimonious but accurate abstraction of structure -- is an established vision problem. This is a good model of a difficult fitting problem: different scenes require different numbers of primitives and primitives interact strongly, but any proposed solution can be evaluated at inference time. The state of the art method involves a learned regression procedure to predict a start point consisting of a fixed number of primitives, followed by a descent method to refine the geometry and remove redundant primitives. Methods are evaluated by accuracy in depth and normal prediction and in scene segmentation. This paper shows that very significant improvements in accuracy can be obtained by (a) incorporating a small number of negative primitives and (b) ensembling over a number of different regression procedures. Ensembling is by refining each predicted start point, then choosing the best by fitting loss. Extensive experiments on a standard dataset confirm that negative primitives are useful in a large fraction of images, and that our refine-then-choose strategy outperforms choose-then-refine, confirming that the fitting problem is very difficult.
- [82] arXiv:2405.19570 [pdf, ps, html, other]
-
Title: Distributed Online Planning for Min-Max Problems in Networked Markov GamesComments: Accepted to appear in the IEEE Robotics and Automation LettersSubjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Min-max problems are important in multi-agent sequential decision-making because they improve the performance of the worst-performing agent in the network. However, solving the multi-agent min-max problem is challenging. We propose a modular, distributed, online planning-based algorithm that is able to approximate the solution of the min-max objective in networked Markov games, assuming that the agents communicate within a network topology and the transition and reward functions are neighborhood-dependent. This set-up is encountered in the multi-robot setting. Our method consists of two phases at every planning step. In the first phase, each agent obtains sample returns based on its local reward function, by performing online planning. Using the samples from online planning, each agent constructs a concave approximation of its underlying local return as a function of only the action of its neighborhood at the next planning step. In the second phase, the agents deploy a distributed optimization framework that converges to the optimal immediate next action for each agent, based on the function approximations of the first phase. We demonstrate our algorithm's performance through formation control simulations.
- [83] arXiv:2405.19572 [pdf, ps, html, other]
-
Title: Blind Image Restoration via Fast Diffusion InversionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, various methods have been proposed to solve Image Restoration (IR) tasks using a pre-trained diffusion model leading to state-of-the-art performance. However, most of these methods assume that the degradation operator in the IR task is completely known. Furthermore, a common characteristic among these approaches is that they alter the diffusion sampling process in order to satisfy the consistency with the degraded input image. This choice has recently been shown to be sub-optimal and to cause the restored image to deviate from the data manifold. To address these issues, we propose Blind Image Restoration via fast Diffusion inversion (BIRD) a blind IR method that jointly optimizes for the degradation model parameters and the restored image. To ensure that the restored images lie onto the data manifold, we propose a novel sampling technique on a pre-trained diffusion model. A key idea in our method is not to modify the reverse sampling, i.e., not to alter all the intermediate latents, once an initial noise is sampled. This is ultimately equivalent to casting the IR task as an optimization problem in the space of the input noise. Moreover, to mitigate the computational cost associated with inverting a fully unrolled diffusion model, we leverage the inherent capability of these models to skip ahead in the forward diffusion process using large time steps. We experimentally validate BIRD on several image restoration tasks and show that it achieves state of the art performance on all of them. Our code is available at this https URL.
- [84] arXiv:2405.19575 [pdf, ps, html, other]
-
Title: A Deep Convolutional Neural Network-based Model for Aspect and Polarity Classification in Hausa Movie ReviewsComments: To be published in the proceedings of ICCAIT 2023Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Aspect-based Sentiment Analysis (ABSA) is crucial for understanding sentiment nuances in text, especially across diverse languages and cultures. This paper introduces a novel Deep Convolutional Neural Network (CNN)-based model tailored for aspect and polarity classification in Hausa movie reviews, an underrepresented language in sentiment analysis research. A comprehensive Hausa ABSA dataset is created, filling a significant gap in resource availability. The dataset, preprocessed using sci-kit-learn for TF-IDF transformation, includes manually annotated aspect-level feature ontology words and sentiment polarity assignments. The proposed model combines CNNs with attention mechanisms for aspect-word prediction, leveraging contextual information and sentiment polarities. With 91% accuracy on aspect term extraction and 92% on sentiment polarity classification, the model outperforms traditional machine models, offering insights into specific aspects and sentiments. This study advances ABSA research, particularly in underrepresented languages, with implications for cross-cultural linguistic research.
- [85] arXiv:2405.19576 [pdf, ps, html, other]
-
Title: Transforming Information Systems Management: A Reference Model for Digital Engineering IntegrationSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Digital engineering practices offer significant yet underutilized potential for improving information assurance and system lifecycle management. This paper examines how capabilities like model-based engineering, digital threads, and integrated product lifecycles can address gaps in prevailing frameworks. A reference model demonstrates applying digital engineering techniques to a reference information system, exhibiting enhanced traceability, risk visibility, accuracy, and integration. The model links strategic needs to requirements and architecture while reusing authoritative elements across views. Analysis of the model shows digital engineering closes gaps in compliance, monitoring, change management, and risk assessment. Findings indicate purposeful digital engineering adoption could transform cybersecurity, operations, service delivery, and system governance through comprehensive digital system representations. This research provides a foundation for maturing application of digital engineering for information systems as organizations modernize infrastructure and pursue digital transformation.
- [86] arXiv:2405.19578 [pdf, ps, html, other]
-
Title: The Accuracy of Domain Specific and Descriptive Analysis Generated by Large Language ModelsComments: 12 pagesSubjects: Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)
Large language models (LLMs) have attracted considerable attention as they are capable of showcasing impressive capabilities generating comparable high-quality responses to human inputs. LLMs, can not only compose textual scripts such as emails and essays but also executable programming code. Contrary, the automated reasoning capability of these LLMs in performing statistically-driven descriptive analysis, particularly on user-specific data and as personal assistants to users with limited background knowledge in an application domain who would like to carry out basic, as well as advanced statistical and domain-specific analysis is not yet fully explored. More importantly, the performance of these LLMs has not been compared and discussed in detail when domain-specific data analysis tasks are needed. This study, consequently, explores whether LLMs can be used as generative AI-based personal assistants to users with minimal background knowledge in an application domain infer key data insights. To demonstrate the performance of the LLMs, the study reports a case study through which descriptive statistical analysis, as well as Natural Language Processing (NLP) based investigations, are performed on a number of phishing emails with the objective of comparing the accuracy of the results generated by LLMs to the ones produced by analysts. The experimental results show that LangChain and the Generative Pre-trained Transformer (GPT-4) excel in numerical reasoning tasks i.e., temporal statistical analysis, achieve competitive correlation with human judgments on feature engineering tasks while struggle to some extent on domain specific knowledge reasoning, where domain-specific knowledge is required.
- [87] arXiv:2405.19580 [pdf, ps, html, other]
-
Title: Facilitating Mixed-Methods Analysis with Computational NotebooksComments: Appeared at 1st ACM CHI Workshop on Human-Notebook InteractionsSubjects: Human-Computer Interaction (cs.HC)
Data exploration is an important aspect of the workflow of mixed-methods researchers, who conduct both qualitative and quantitative analysis. However, there currently exists few tools that adequately support both types of analysis simultaneously, forcing researchers to context-switch between different tools and increasing their mental burden when integrating the results. To address this gap, we propose a unified environment that facilitates mixed-methods analysis in a computational notebook-based settings. We conduct a scenario study with three HCI mixed-methods researchers to gather feedback on our design concept and to understand our users' needs and requirements.
- [88] arXiv:2405.19581 [pdf, ps, html, other]
-
Title: Source Code Foundation Models are Transferable Binary Analysis Knowledge BasesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.
- [89] arXiv:2405.19582 [pdf, ps, html, other]
-
Title: Evaluation of Resonances via AAA Rational Approximation of Randomly Scalarized Boundary Integral ResolventsSubjects: Numerical Analysis (math.NA)
This paper presents a novel algorithm, based on use of rational approximants of randomly scalarized boundary integral resolvents, for the evaluation of acoustic and electromagnetic resonances in open and closed cavities; for simplicity we restrict treatment to cavities in two-dimensional space. The desired open cavity resonances (also known as ``eigenvalues'' for interior problems, and ``scattering poles'' for exterior and open problems) are obtained as the poles of associated rational approximants; both the approximants and their poles are obtained by means of the recently introduced AAA rational-approximation algorithm. In fact, the proposed resonance-search method applies to any nonlinear eigenvalue problem (NEP) associated with a given function $F: U \to \mathbb{C}^{n\times n}$, wherein a complex value $k$ is sought for which $F_kw = 0$ for some nonzero $w\in \mathbb{C}^n$. For the cavity problems considered in this paper, $F_k$ is taken as a spectrally discretized version of a Green function-based boundary integral operator at spatial frequency $k$. In all cases, the scalarized resolvent is given by an expression of the form $u^* F_k^{-1} v$, where $u,v \in \mathbb{C}^n$ are fixed random vectors. A variety of numerical results are presented for both scattering resonances and other NEPs, demonstrating the accuracy of the method even for high frequency states.
- [90] arXiv:2405.19586 [pdf, ps, html, other]
-
Title: SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied ManipulationComments: ICML 2024. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
- [91] arXiv:2405.19590 [pdf, ps, html, other]
-
Title: Weights Augmentation: it has never ever ever ever let her model downSubjects: Machine Learning (cs.LG)
Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss function to affect parameter updates. However, stochastic gradient descent is applied to Plain Weight(PW), which is referred to as the original weight of the network before the random transformation. During training, numerous SW collectively form high-dimensional space, while PW is directly learned from the distribution of SW instead of the data. The weight of the accuracy-oriented mode(AOM) relies on PW, which guarantees the network is highly robust and accurate. The desire-oriented mode(DOM) weight uses SW, which is determined by the network model's unique functions based on WAT's performance desires, such as lower computational complexity, lower sensitivity to particular data, etc. The dual mode be switched at anytime if needed. WAT extends the augmentation technique from data augmentation to weight, and it is easy to understand and implement, but it can improve almost all networks amazingly. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost. The accuracy of models is on the CIFAR100 and CIFAR10 datasets, which can be evaluated to increase by 7.32\% and 9.28\%, respectively, with the highest values being 13.42\% and 18.93\%, respectively. In addition, DOM can reduce floating point operations (FLOPs) by up to 36.33\%. The code is available at this https URL.
- [92] arXiv:2405.19592 [pdf, ps, html, other]
-
Title: Why Larger Language Models Do In-context Learning Differently?Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.
- [93] arXiv:2405.19595 [pdf, ps, other]
-
Title: The RSNA Abdominal Traumatic Injury CT (RATIC) DatasetJeffrey D. Rudie, Hui-Ming Lin, Robyn L. Ball, Sabeena Jalal, Luciano M. Prevedello, Savvas Nicolaou, Brett S. Marinelli, Adam E. Flanders, Kirti Magudia, George Shih, Melissa A. Davis, John Mongan, Peter D. Chang, Ferco H. Berger, Sebastiaan Hermans, Meng Law, Tyler Richards, Jan-Peter Grunz, Andreas Steven Kunz, Shobhit Mathur, Sandro Galea-Soler, Andrew D. Chung, Saif Afat, Chin-Chi Kuo, Layal Aweidah, Ana Villanueva Campos, Arjuna Somasundaram, Felipe Antonio Sanchez Tijmes, Attaporn Jantarangkoon, Leonardo Kayat Bittencourt, Michael Brassil, Ayoub El Hajjami, Hakan Dogan, Muris Becircic, Agrahara G. Bharatkumar, Eduardo Moreno Júdice de Mattos Farina, Dataset Curator Group, Dataset Contributor Group, Dataset Annotator Group, Errol ColakComments: 40 pages, 2 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
The RSNA Abdominal Traumatic Injury CT (RATIC) dataset is the largest publicly available collection of adult abdominal CT studies annotated for traumatic injuries. This dataset includes 4,274 studies from 23 institutions across 14 countries. The dataset is freely available for non-commercial use via Kaggle at this https URL. Created for the RSNA 2023 Abdominal Trauma Detection competition, the dataset encourages the development of advanced machine learning models for detecting abdominal injuries on CT scans. The dataset encompasses detection and classification of traumatic injuries across multiple organs, including the liver, spleen, kidneys, bowel, and mesentery. Annotations were created by expert radiologists from the American Society of Emergency Radiology (ASER) and Society of Abdominal Radiology (SAR). The dataset is annotated at multiple levels, including the presence of injuries in three solid organs with injury grading, image-level annotations for active extravasations and bowel injury, and voxelwise segmentations of each of the potentially injured organs. With the release of this dataset, we hope to facilitate research and development in machine learning and abdominal trauma that can lead to improved patient care and outcomes.
- [94] arXiv:2405.19596 [pdf, ps, html, other]
-
Title: The weight hierarchies of three classes of linear codesSubjects: Information Theory (cs.IT)
Studying the generalized Hamming weights of linear codes is a significant research area within coding theory, as it provides valuable structural information about the codes and plays a crucial role in determining their performance in various applications. However, determining the generalized Hamming weights of linear codes, particularly their weight hierarchy, is generally a challenging task. In this paper, we focus on investigating the generalized Hamming weights of three classes of linear codes over finite fields. These codes are constructed by different defining sets. By analysing the intersections between the definition sets and the duals of all $r$-dimensional subspaces, we get the inequalities on the sizes of these intersections. Then constructing subspaces that reach the upper bounds of these inequalities, we successfully determine the complete weight hierarchies of these codes.
- [95] arXiv:2405.19597 [pdf, ps, html, other]
-
Title: SVFT: Parameter-Efficient Fine-Tuning with Singular VectorsVijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, Sujay SanghaviComments: 17 pages, 5 figures, 14 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its variants, freeze pre-trained model weights \(W\) and inject learnable matrices \(\Delta W\). These \(\Delta W\) matrices are structured for efficient parameterization, often using techniques like low-rank approximations or scaling vectors. However, these methods typically show a performance gap compared to full fine-tuning. Although recent PEFT methods have narrowed this gap, they do so at the cost of additional learnable parameters. We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on \(\Delta W\) depends on the specific weight matrix \(W\). Specifically, SVFT updates \(W\) as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. This approach allows fine-grained control over expressivity through the number of coefficients. Extensive experiments on language and vision benchmarks show that SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance using 0.03 to 0.8% of the trainable parameter budget.
- [96] arXiv:2405.19598 [pdf, ps, html, other]
-
Title: Evaluating the Effectiveness and Robustness of Visual Similarity-based Phishing Detection ModelsComments: 12 pagesSubjects: Cryptography and Security (cs.CR)
Phishing attacks pose a significant threat to Internet users, with cybercriminals elaborately replicating the visual appearance of legitimate websites to deceive victims. Visual similarity-based detection systems have emerged as an effective countermeasure, but their effectiveness and robustness in real-world scenarios have been unexplored. In this paper, we comprehensively scrutinize and evaluate state-of-the-art visual similarity-based anti-phishing models using a large-scale dataset of 450K real-world phishing websites. Our analysis reveals that while certain models maintain high accuracy, others exhibit notably lower performance than results on curated datasets, highlighting the importance of real-world evaluation. In addition, we observe the real-world tactic of manipulating visual components that phishing attackers employ to circumvent the detection systems. To assess the resilience of existing models against adversarial attacks and robustness, we apply visible and perturbation-based manipulations to website logos, which adversaries typically target. We then evaluate the models' robustness in handling these adversarial samples. Our findings reveal vulnerabilities in several models, emphasizing the need for more robust visual similarity techniques capable of withstanding sophisticated evasion attempts. We provide actionable insights for enhancing the security of phishing defense systems, encouraging proactive actions. To the best of our knowledge, this work represents the first large-scale, systematic evaluation of visual similarity-based models for phishing detection in real-world settings, necessitating the development of more effective and robust defenses.
- [97] arXiv:2405.19600 [pdf, ps, html, other]
-
Title: Do spectral cues matter in contrast-based graph self-supervised learning?Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The recent surge in contrast-based graph self-supervised learning has prominently featured an intensified exploration of spectral cues. However, an intriguing paradox emerges, as methods grounded in seemingly conflicting assumptions or heuristic approaches regarding the spectral domain demonstrate notable enhancements in learning performance. This paradox prompts a critical inquiry into the genuine contribution of spectral information to contrast-based graph self-supervised learning. This study undertakes an extensive investigation into this inquiry, conducting a thorough study of the relationship between spectral characteristics and the learning outcomes of contemporary methodologies. Based on this analysis, we claim that the effectiveness and significance of spectral information need to be questioned. Instead, we revisit simple edge perturbation: random edge dropping designed for node-level self-supervised learning and random edge adding intended for graph-level self-supervised learning. Compelling evidence is presented that these simple yet effective strategies consistently yield superior performance while demanding significantly fewer computational resources compared to all prior spectral augmentation methods. The proposed insights represent a significant leap forward in the field, potentially reshaping the understanding and implementation of graph self-supervised learning.
- [98] arXiv:2405.19606 [pdf, ps, html, other]
-
Title: Relation Modeling and Distillation for Learning with Noisy LabelsSubjects: Artificial Intelligence (cs.AI)
Learning with noisy labels has become an effective strategy for enhancing the robustness of models, which enables models to better tolerate inaccurate data. Existing methods either focus on optimizing the loss function to mitigate the interference from noise, or design procedures to detect potential noise and correct errors. However, their effectiveness is often compromised in representation learning due to the dilemma where models overfit to noisy labels. To address this issue, this paper proposes a relation modeling and distillation framework that models inter-sample relationships via self-supervised learning and employs knowledge distillation to enhance understanding of latent associations, which mitigate the impact of noisy labels. Specifically, the proposed method, termed RMDNet, includes two main modules, where the relation modeling (RM) module implements the contrastive learning technique to learn representations of all data, an unsupervised approach that effectively eliminates the interference of noisy tags on feature extraction. The relation-guided representation learning (RGRL) module utilizes inter-sample relation learned from the RM module to calibrate the representation distribution for noisy samples, which is capable of improving the generalization of the model in the inference phase. Notably, the proposed RMDNet is a plug-and-play framework that can integrate multiple methods to its advantage. Extensive experiments were conducted on two datasets, including performance comparison, ablation study, in-depth analysis and case study. The results show that RMDNet can learn discriminative representations for noisy data, which results in superior performance than the existing methods.
- [99] arXiv:2405.19609 [pdf, ps, html, other]
-
Title: SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture AnnotationsYujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing ShaoComments: ICME 2024;Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.
- [100] arXiv:2405.19612 [pdf, ps, html, other]
-
Title: Keyword-driven Retrieval-Augmented Large Language Models for Cold-start User RecommendationsComments: 10 pages, 10 figures, 4 tablesSubjects: Information Retrieval (cs.IR)
Recent advancements in Large Language Models (LLMs) have shown significant potential in enhancing recommender systems. However, addressing the cold-start recommendation problem, where users lack historical data, remains a considerable challenge. In this paper, we introduce KALM4Rec (Keyword-driven Retrieval-Augmented Large Language Models for Cold-start User Recommendations), a novel framework specifically designed to tackle this problem by requiring only a few input keywords from users in a practical scenario of cold-start user restaurant recommendations. KALM4Rec operates in two main stages: candidates retrieval and LLM-based candidates re-ranking. In the first stage, keyword-driven retrieval models are used to identify potential candidates, addressing LLMs' limitations in processing extensive tokens and reducing the risk of generating misleading information. In the second stage, we employ LLMs with various prompting strategies, including zero-shot and few-shot techniques, to re-rank these candidates by integrating multiple examples directly into the LLM prompts. Our evaluation, using a Yelp restaurant dataset with user reviews from three English-speaking cities, shows that our proposed framework significantly improves recommendation quality. Specifically, the integration of in-context instructions with LLMs for re-ranking markedly enhances the performance of the cold-start user recommender system.
- [101] arXiv:2405.19614 [pdf, ps, html, other]
-
Title: TAMBRIDGE: Bridging Frame-Centered Tracking and 3D Gaussian Splatting for Enhanced SLAMSubjects: Robotics (cs.RO)
The limited robustness of 3D Gaussian Splatting (3DGS) to motion blur and camera noise, along with its poor real-time performance, restricts its application in robotic SLAM tasks. Upon analysis, the primary causes of these issues are the density of views with motion blur and the cumulative errors in dense pose estimation from calculating losses based on noisy original images and rendering results, which increase the difficulty of 3DGS rendering convergence. Thus, a cutting-edge 3DGS-based SLAM system is introduced, leveraging the efficiency and flexibility of 3DGS to achieve real-time performance while remaining robust against sensor noise, motion blur, and the challenges posed by long-session SLAM. Central to this approach is the Fusion Bridge module, which seamlessly integrates tracking-centered ORB Visual Odometry with mapping-centered online 3DGS. Precise pose initialization is enabled by this module through joint optimization of re-projection and rendering loss, as well as strategic view selection, enhancing rendering convergence in large-scale scenes. Extensive experiments demonstrate state-of-the-art rendering quality and localization accuracy, positioning this system as a promising solution for real-world robotics applications that require stable, near-real-time performance. Our project is available at this https URL
- [102] arXiv:2405.19616 [pdf, ps, html, other]
-
Title: Easy Problems That LLMs Get WrongComments: AutogenAI Ltd. Associated code at this https URLSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.
- [103] arXiv:2405.19620 [pdf, ps, html, other]
-
Title: SparseDrive: End-to-End Autonomous Driving via Sparse Scene RepresentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy , which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency. Code will be avaliable at this https URL for facilitating future research.
- [104] arXiv:2405.19622 [pdf, ps, html, other]
-
Title: On shortest products for nonnegative matrix mortalitySubjects: Discrete Mathematics (cs.DM); Formal Languages and Automata Theory (cs.FL); Combinatorics (math.CO)
Given a finite set of matrices with integer entries, the matrix mortality problem asks if there exists a product of these matrices equal to the zero matrix. We consider a special case of this problem where all entries of the matrices are nonnegative. This case is equivalent to the NFA mortality problem, which, given an NFA, asks for a word $w$ such that the image of every state under $w$ is the empty set. The size of the alphabet of the NFA is then equal to the number of matrices in the set. We study the length of shortest such words depending on the size of the alphabet. We show that this length for an NFA with $n$ states can be at least $2^n - 1$, $2^{(n - 4)/2}$ and $2^{(n - 2)/3}$ if the size of the alphabet is, respectively, equal to $n$, three and two.
- [105] arXiv:2405.19623 [pdf, ps, html, other]
-
Title: A Novel Approach for Automated Design Information Mining from Issue LogsSubjects: Software Engineering (cs.SE)
Software architectures are usually meticulously designed to address multiple quality concerns and support long-term maintenance. However, due to the imbalance between the cost and value for developers to document design rationales (i.e., the design alternatives and the underlying arguments for making or rejecting decisions), these rationales are often obsolete or even missing. The lack of design knowledge has motivated a number of studies to extract design information from various platforms in recent years. Unfortunately, despite the wealth of discussion records related to design information provided by platforms like open-source communities, existing research often overlooks the underlying arguments behind alternatives due to challenges such as the intricate semantics of discussions and the lack of benchmarks for design rationale extraction. In this paper, we propose a novel method, named by DRMiner, to automatically mine latent design rationales from developers' live discussion in open-source community (i.e., issue logs in Jira). To better identify solutions and the arguments supporting them, DRMiner skillfully decomposes the problem into multiple text classification tasks and tackles them using prompt tuning of language models and customized text-related features. To evaluate DRMiner, we acquire issue logs from Cassandra, Flink, and Solr repositories in Jira, and then annotate and process them under a rigorous scheme, ultimately forming a dataset for design rationale mining. Experimental results show that DRMiner achieves an F1 score of 65% for mining design rationales, outperforming all baselines with a 7% improvement over GPT-4.0. Furthermore, we investigate the usefulness of the design rationales mined by DRMiner for automated program repair (APR) and find that the design rationales significantly enhance APR, achieving 14 times higher full-match repairs on average.
- [106] arXiv:2405.19626 [pdf, ps, html, other]
-
Title: Position: CXL Shared Memory Programming: Barely Distributed and Almost PersistentSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
While Compute Express Link (CXL) enables support for cache-coherent shared memory among multiple nodes, it also introduces new types of failures--processes can fail before data does, or data might fail before a process does. The lack of a failure model for CXL-based shared memory makes it challenging to understand and mitigate these failures.
To solve these challenges, in this paper, we describe a model categorizing and handling the CXL-based shared memory's failures: data and process failures. Data failures in CXL-based shared memory render data inaccessible or inconsistent for a currently running application. We argue that such failures are unlike data failures in distributed storage systems and require CXL-specific handling. To address this, we look into traditional data failure mitigation techniques like erasure coding and replication and propose new solutions to better handle data failures in CXL-based shared memory systems. Next, we look into process failures and compare the failures and potential solutions with PMEM's failure model and programming solutions. We argue that although PMEM shares some of CXL's characteristics, it does not fully address CXL's volatile nature and low access latencies. Finally, taking inspiration from PMEM programming solutions, we propose techniques to handle these new failures.
Thus, this paper is the first work to define the CXL-based shared memory failure model and propose tailored solutions that address challenges specific to CXL-based systems. - [107] arXiv:2405.19628 [pdf, ps, other]
-
Title: Deep Learning Model for Detecting Abnormal Corn KernelsSubjects: Computational Engineering, Finance, and Science (cs.CE)
This research aims to detect the physical characteristics of corn kernels and analyze images using a deep learning model. The data analysis based on the CRISP-DM framework which consists of six steps, business understanding, data understanding, data preparation, modelling, evaluation, and deployment. The business goal reduces the cost of the separation of abnormal corn kernels. The dataset comprises 1,800 images of corn kernels and divided equally between normal and abnormal corn kernels. The dataset was divided into three subsets: 1,000 images for training the deep learning model, 600 images for validation and 200 images for testing. The tools for analysis in this research are Jupyter Lab, Python, TensorFlow Keras, and Convolutional Neural Networks. The results revealed that the deep learning model achieved the accuracy rate of 99% in differentiating between normal and abnormal corn kernel images that is a highly effective model in this context.
- [108] arXiv:2405.19629 [pdf, ps, html, other]
-
Title: YotoR-You Only Transform One RepresentationComments: 16 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper introduces YotoR (You Only Transform One Representation), a novel deep learning model for object detection that combines Swin Transformers and YoloR architectures. Transformers, a revolutionary technology in natural language processing, have also significantly impacted computer vision, offering the potential to enhance accuracy and computational efficiency. YotoR combines the robust Swin Transformer backbone with the YoloR neck and head. In our experiments, YotoR models TP5 and BP4 consistently outperform YoloR P6 and Swin Transformers in various evaluations, delivering improved object detection performance and faster inference speeds than Swin Transformer models. These results highlight the potential for further model combinations and improvements in real-time object detection with Transformers. The paper concludes by emphasizing the broader implications of YotoR, including its potential to enhance transformer-based models for image-related tasks.
- [109] arXiv:2405.19630 [pdf, ps, other]
-
Title: The use of a humanoid robot for older people with dementia in aged care facilitiesComments: Accepted for the Second Workshop on Care Robots for Older Adults (CROA), RO-MAN 2023, Busan, KoreaSubjects: Robotics (cs.RO)
This paper presents an interdisciplinary PhD project using a humanoid robot to encourage interactive activities for people with dementia living in two aged care facilities. The aim of the project was to develop software and use technologies to achieve successful robot-led engagement with older people with dementia. This paper outlines the qualitative findings from the project's feasibility stage. The researcher's observations, the participants' attitudes and the feedback from carers are presented and discussed.
- [110] arXiv:2405.19631 [pdf, ps, html, other]
-
Title: Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent RouterSubjects: Artificial Intelligence (cs.AI)
Social Determinants of Health (SDOH) play a significant role in patient health outcomes. The Center of Disease Control (CDC) introduced a subset of ICD-10 codes called Z-codes in an attempt to officially recognize and measure SDOH in the health care system. However, these codes are rarely annotated in a patient's Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs. However, with thousands of models to choose from with unique architectures and training sets, it's difficult to choose one model that performs the best on coding tasks. Further, clinical notes contain trusted health information making the use of closed-source language models from commercial vendors difficult, so the identification of open source LLMs that can be run within health organizations and exhibits high performance on SDOH tasks is an urgent problem. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open source LLMs that demonstrate optimal performance on specific SDOH codes. The intelligent routing system exhibits state of the art performance of 97.4% accuracy averaged across 5 codes, including homelessness and food insecurity, on par with closed models such as GPT-4o. In order to train the routing system and validate models, we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.
- [111] arXiv:2405.19635 [pdf, ps, html, other]
-
Title: GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM DeploymentSubjects: Computation and Language (cs.CL)
The burgeoning size of Large Language Models (LLMs) has led to enhanced capabilities in generating responses, albeit at the expense of increased inference times and elevated resource demands. Existing methods of acceleration, predominantly hinged on knowledge distillation, generally necessitate fine-tuning of considerably large models, such as Llama-7B, posing a challenge for average users. Furthermore, present techniques for expediting inference and reducing costs operate independently. To address these issues, we introduce a novel and intuitive Guidance-based Knowledge Transfer (GKT) framework. This approach leverages a larger LLM as a ''teacher'' to create guidance prompts, paired with a smaller ''student'' model to finalize responses. Remarkably, GKT requires no fine-tuning and doesn't necessitate the teacher and student models to have the same vocabulary, allowing for extensive batch generation to accelerate the process while ensuring user customization. GKT can be seamlessly integrated into cloud-edge collaboration architectures, and is versatile enough for plug-and-play application across various models. It excels in both efficiency and affordability, epitomizing a ''cheap and cheerful'' solution. GKT achieves a maximum accuracy improvement of 14.18%, along with a 10.72 times speed-up on GSM8K and an accuracy improvement of 14.00 % along with a 7.73 times speed-up in CSQA. When utilizing ChatGPT as teacher model and Llama2-70B as the student model, we can achieve 95.00% of ChatGPT's performance at 52% of the cost. The results highlight substantial enhancements in accuracy and processing speed on the GSM8K and CSQA datasets, surpassing the performance of using either the student or teacher models in isolation.
- [112] arXiv:2405.19636 [pdf, ps, html, other]
-
Title: Creating Language-driven Spatial Variations of Icon ImagesSubjects: Graphics (cs.GR)
Editing 2D icon images can require significant manual effort from designers. It involves manipulating multiple geometries while maintaining the logical or physical coherence of the objects depicted in the image. Previous language driven image editing methods can change the texture and geometry of objects in the image but fail at producing spatial variations, i.e. modifying spatial relations between objects while maintaining their identities. We present a language driven editing method that can produce spatial variations of icon images. Our method takes in an icon image along with a user's editing request text prompt and outputs an edited icon image reflecting the user's editing request. Our method is designed based on two key observations: (1) A user's editing requests can be translated by a large language model (LLM), with help from a domain specific language (DSL) library, into to a set of geometrical constraints defining the relationships between segments in an icon image. (2) Optimizing the affine transformations of the segments with respect to these geometrical constraints can produce icon images that fulfill the editing request and preserve overall physical and logical coherence. Quantitative and qualitative results show that our system outperforms multiple baselines, enabling natural editing of icon images.
- [113] arXiv:2405.19638 [pdf, ps, html, other]
-
Title: Learning Robust Correlation with Foundation Model for Weakly-Supervised Few-Shot SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Existing few-shot segmentation (FSS) only considers learning support-query correlation and segmenting unseen categories under the precise pixel masks. However, the cost of a large number of pixel masks during training is expensive. This paper considers a more challenging scenario, weakly-supervised few-shot segmentation (WS-FSS), which only provides category ($i.e.$ image-level) labels. It requires the model to learn robust support-query information when the generated mask is inaccurate. In this work, we design a Correlation Enhancement Network (CORENet) with foundation model, which utilizes multi-information guidance to learn robust correlation. Specifically, correlation-guided transformer (CGT) utilizes self-supervised ViT tokens to learn robust correlation from both local and global perspectives. From the perspective of semantic categories, the class-guided module (CGM) guides the model to locate valuable correlations through the pre-trained CLIP. Finally, the embedding-guided module (EGM) implicitly guides the model to supplement the inevitable information loss during the correlation learning by the original appearance embedding and finally generates the query mask. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ have shown that CORENet exhibits excellent performance compared to existing methods.
- [114] arXiv:2405.19641 [pdf, ps, html, other]
-
Title: Reconciling Safety Measurement and Dynamic AssuranceComments: 12 pages, 4 figures, Accepted to the 43rd International Conference on Computer Safety, Reliability, and Security (SAFECOMP 2024), Florence, ItalySubjects: Software Engineering (cs.SE)
We propose a new framework to facilitate dynamic assurance within a safety case approach by associating safety performance measurement with the core assurance artifacts of a safety case. The focus is mainly on the safety architecture, whose underlying risk assessment model gives the concrete link from safety measurement to operational risk. Using an aviation domain example of autonomous taxiing, we describe our approach to derive safety indicators and revise the risk assessment based on safety measurement. We then outline a notion of consistency between a collection of safety indicators and the safety case, as a formal basis for implementing the proposed framework in our tool, AdvoCATE.
- [115] arXiv:2405.19642 [pdf, ps, other]
-
Title: Few-shot fault diagnosis based on multi-scale graph convolution filtering for industryMengjie Gan, Penglong Lian, Zhiheng Su, Jiyang Zhang, Jialong Huang, Benhao Wang, Jianxiao Zou, Shicai FanComments: 6 pages, 2 figures, 2 tables, 63rd IEEE Conference on Decision and ControlSubjects: Artificial Intelligence (cs.AI)
Industrial equipment fault diagnosis often encounter challenges such as the scarcity of fault data, complex operating conditions, and varied types of failures. Signal analysis, data statistical learning, and conventional deep learning techniques face constraints under these conditions due to their substantial data requirements and the necessity for transfer learning to accommodate new failure modes. To effectively leverage information and extract the intrinsic characteristics of faults across different domains under limited sample conditions, this paper introduces a fault diagnosis approach employing Multi-Scale Graph Convolution Filtering (MSGCF). MSGCF enhances the traditional Graph Neural Network (GNN) framework by integrating both local and global information fusion modules within the graph convolution filter block. This advancement effectively mitigates the over-smoothing issue associated with excessive layering of graph convolutional layers while preserving a broad receptive field. It also reduces the risk of overfitting in few-shot diagnosis, thereby augmenting the model's representational capacity. Experiments on the University of Paderborn bearing dataset (PU) demonstrate that the MSGCF method proposed herein surpasses alternative approaches in accuracy, thereby offering valuable insights for industrial fault diagnosis in few-shot learning scenarios.
- [116] arXiv:2405.19644 [pdf, ps, html, other]
-
Title: EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery VideosComments: Early accepted by MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Surgical phase recognition has gained significant attention due to its potential to offer solutions to numerous demands of the modern operating room. However, most existing methods concentrate on minimally invasive surgery (MIS), leaving surgical phase recognition for open surgery understudied. This discrepancy is primarily attributed to the scarcity of publicly available open surgery video datasets for surgical phase recognition. To address this issue, we introduce a new egocentric open surgery video dataset for phase recognition, named EgoSurgery-Phase. This dataset comprises 15 hours of real open surgery videos spanning 9 distinct surgical phases all captured using an egocentric camera attached to the surgeon's head. In addition to video, the EgoSurgery-Phase offers eye gaze. As far as we know, it is the first real open surgery video dataset for surgical phase recognition publicly available. Furthermore, inspired by the notable success of masked autoencoders (MAEs) in video understanding tasks (e.g., action recognition), we propose a gaze-guided masked autoencoder (GGMAE). Considering the regions where surgeons' gaze focuses are often critical for surgical phase recognition (e.g., surgical field), in our GGMAE, the gaze information acts as an empirical semantic richness prior to guiding the masking process, promoting better attention to semantically rich spatial regions. GGMAE significantly improves the previous state-of-the-art recognition method (6.4% in Jaccard) and the masked autoencoder-based method (3.1% in Jaccard) on EgoSurgery-Phase. The dataset will be released at this https URL.
- [117] arXiv:2405.19646 [pdf, ps, html, other]
-
Title: FaceLift: Semi-supervised 3D Facial Landmark LocalizationComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
3D facial landmark localization has proven to be of particular use for applications, such as face tracking, 3D face modeling, and image-based 3D face reconstruction. In the supervised learning case, such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment, as compared with that chosen by hand-labeled human consensus, e.g., how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs, and it ultimately limits their effectiveness. To address this issue, we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment, without the need for 3D landmark datasets. To lift 2D landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: this https URL
- [118] arXiv:2405.19647 [pdf, ps, html, other]
-
Title: FTS: A Framework to Find a Faithful TimeSieveSubjects: Machine Learning (cs.LG)
The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds and minute input noise perturbations. Recognizing these challenges, we embark on a quest to define the concept of \textbf{\underline{F}aithful \underline{T}ime\underline{S}ieve \underline{(FTS)}}, a model that consistently delivers reliable and robust predictions. To address these issues, we propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve. Our framework is designed to enhance the model's stability and resilience, ensuring that its outputs are less susceptible to the aforementioned factors. Experimentation validates the effectiveness of our proposed framework, demonstrating improved faithfulness in the model's behavior. Looking forward, we plan to expand our experimental scope to further validate and optimize our algorithm, ensuring comprehensive faithfulness across a wide range of scenarios. Ultimately, we aspire to make this framework can be applied to enhance the faithfulness of not just TimeSieve but also other state-of-the-art temporal methods, thereby contributing to the reliability and robustness of temporal modeling as a whole.
- [119] arXiv:2405.19648 [pdf, ps, html, other]
-
Title: Detecting Hallucinations in Large Language Model Generation: A Token Probability ApproachComments: ICAI'24 - The 26th Int'l Conf on Artificial IntelligenceSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at this https URL.
- [120] arXiv:2405.19649 [pdf, ps, html, other]
-
Title: Towards Deeper Understanding of PPR-based Embedding Approaches: A Topological PerspectiveSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Node embedding learns low-dimensional vectors for nodes in the graph. Recent state-of-the-art embedding approaches take Personalized PageRank (PPR) as the proximity measure and factorize the PPR matrix or its adaptation to generate embeddings. However, little previous work analyzes what information is encoded by these approaches, and how the information correlates with their superb performance in downstream tasks. In this work, we first show that state-of-the-art embedding approaches that factorize a PPR-related matrix can be unified into a closed-form framework. Then, we study whether the embeddings generated by this strategy can be inverted to better recover the graph topology information than random-walk based embeddings. To achieve this, we propose two methods for recovering graph topology via PPR-based embeddings, including the analytical method and the optimization method. Extensive experimental results demonstrate that the embeddings generated by factorizing a PPR-related matrix maintain more topological information, such as common edges and community structures, than that generated by random walks, paving a new way to systematically comprehend why PPR-based node embedding approaches outperform random walk-based alternatives in various downstream tasks. To the best of our knowledge, this is the first work that focuses on the interpretability of PPR-based node embedding approaches.
- [121] arXiv:2405.19650 [pdf, ps, html, other]
-
Title: Few for Many: Tchebycheff Set Scalarization for Many-Objective OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Multi-objective optimization can be found in many real-world applications where some conflicting objectives can not be optimized by a single solution. Existing optimization methods often focus on finding a set of Pareto solutions with different optimal trade-offs among the objectives. However, the required number of solutions to well approximate the whole Pareto optimal set could be exponentially large with respect to the number of objectives, which makes these methods unsuitable for handling many optimization objectives. In this work, instead of finding a dense set of Pareto solutions, we propose a novel Tchebycheff set scalarization method to find a few representative solutions (e.g., 5) to cover a large number of objectives (e.g., $>100$) in a collaborative and complementary manner. In this way, each objective can be well addressed by at least one solution in the small solution set. In addition, we further develop a smooth Tchebycheff set scalarization approach for efficient optimization with good theoretical guarantees. Experimental studies on different problems with many optimization objectives demonstrate the effectiveness of our proposed method.
- [122] arXiv:2405.19652 [pdf, ps, html, other]
-
Title: Dual sparse training framework: inducing activation map sparsity via Transformed $\ell1$ regularizationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Although deep convolutional neural networks have achieved rapid development, it is challenging to widely promote and apply these models on low-power devices, due to computational and storage limitations. To address this issue, researchers have proposed techniques such as model compression, activation sparsity induction, and hardware accelerators. This paper presents a method to induce the sparsity of activation maps based on Transformed $\ell1$ regularization, so as to improve the research in the field of activation sparsity induction. Further, the method is innovatively combined with traditional pruning, constituting a dual sparse training framework. Compared to previous methods, Transformed $\ell1$ can achieve higher sparsity and better adapt to different network structures. Experimental results show that the method achieves improvements by more than 20\% in activation map sparsity on most models and corresponding datasets without compromising the accuracy. Specifically, it achieves a 27.52\% improvement for ResNet18 on the ImageNet dataset, and a 44.04\% improvement for LeNet5 on the MNIST dataset. In addition, the dual sparse training framework can greatly reduce the computational load and provide potential for reducing the required storage during runtime. Specifically, the ResNet18 and ResNet50 models obtained by the dual sparse training framework respectively reduce 81.7\% and 84.13\% of multiplicative floating-point operations, while maintaining accuracy and a low pruning rate.
- [123] arXiv:2405.19653 [pdf, ps, html, other]
-
Title: SysCaps: Language Interfaces for Simulation Surrogates of Complex SystemsComments: 17 pages. Under reviewSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Systems and Control (eess.SY)
Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a "system caption", or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.
- [124] arXiv:2405.19654 [pdf, ps, html, other]
-
Title: Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-trainingComments: Accepted at ICML 2024Subjects: Artificial Intelligence (cs.AI)
Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at this https URL.
- [125] arXiv:2405.19656 [pdf, ps, html, other]
-
Title: Accurate and Reliable Predictions with Mutual-Transport EnsembleSubjects: Artificial Intelligence (cs.AI)
Deep Neural Networks (DNNs) have achieved remarkable success in a variety of tasks, especially when it comes to prediction accuracy. However, in complex real-world scenarios, particularly in safety-critical applications, high accuracy alone is not enough. Reliable uncertainty estimates are crucial. Modern DNNs, often trained with cross-entropy loss, tend to be overconfident, especially with ambiguous samples. To improve uncertainty calibration, many techniques have been developed, but they often compromise prediction accuracy. To tackle this challenge, we propose the ``mutual-transport ensemble'' (MTE). This approach introduces a co-trained auxiliary model and adaptively regularizes the cross-entropy loss using Kullback-Leibler (KL) divergence between the prediction distributions of the primary and auxiliary models. We conducted extensive studies on various benchmarks to validate the effectiveness of our method. The results show that MTE can simultaneously enhance both accuracy and uncertainty calibration. For example, on the CIFAR-100 dataset, our MTE method on ResNet34/50 achieved significant improvements compared to previous state-of-the-art method, with absolute accuracy increases of 2.4%/3.7%, relative reductions in ECE of $42.3%/29.4%, and relative reductions in classwise-ECE of 11.6%/15.3%.
- [126] arXiv:2405.19657 [pdf, ps, html, other]
-
Title: Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D GaussianComments: 10pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
3D Gaussian splatting has demonstrated impressive performance in real-time novel view synthesis. However, achieving successful reconstruction from RGB images generally requires multiple input views captured under static conditions. To address the challenge of sparse input views, previous approaches have incorporated depth supervision into the training of 3D Gaussians to mitigate overfitting, using dense predictions from pretrained depth networks as pseudo-ground truth. Nevertheless, depth predictions from monocular depth estimation models inherently exhibit significant uncertainty in specific areas. Relying solely on pixel-wise L2 loss may inadvertently incorporate detrimental noise from these uncertain areas. In this work, we introduce a novel method to supervise the depth distribution of 3D Gaussians, utilizing depth priors with integrated uncertainty estimates. To address these localized errors in depth predictions, we integrate a patch-wise optimal transport strategy to complement traditional L2 loss in depth supervision. Extensive experiments conducted on the LLFF, DTU, and Blender datasets demonstrate that our approach, UGOT, achieves superior novel view synthesis and consistently outperforms state-of-the-art methods.
- [127] arXiv:2405.19659 [pdf, ps, html, other]
-
Title: CSANet: Channel Spatial Attention Network for Robust 3D Face Alignment and ReconstructionComments: 10 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Parameter Distance Cost to learn parameters for 3D Morphable model and 3D vertices. Our proposed model outperforms all baseline models both quantitatively and qualitatively.
- [128] arXiv:2405.19660 [pdf, ps, html, other]
-
Title: PATIENT-{\Psi}: Using Large Language Models to Simulate Patients for Training Mental Health ProfessionalsRuiyi Wang, Stephanie Milani, Jamie C. Chiu, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, Zhiyu Zoey ChenComments: Work in progressSubjects: Computation and Language (cs.CL)
Mental illness remains one of the most critical public health issues, with a significant gap between the available mental health support and patient needs. Many mental health professionals highlight a disconnect between their training and real-world patient interactions, leaving some trainees feeling unprepared and potentially affecting their early career success. In this paper, we propose PATIENT-{\Psi}, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-{\Psi}, we constructed diverse patient profiles and their corresponding cognitive models based on CBT principles, and then used large language models (LLMs) programmed with the patient cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-{\Psi}-TRAINER, for mental health trainees to practice a key skill in CBT -- formulating the cognitive model of the patient -- through role-playing a therapy session with PATIENT-{\Psi}. To evaluate PATIENT-{\Psi}, we conducted a user study of 4 mental health trainees and 10 experts. The results demonstrate that practice using PATIENT-{\Psi}-TRAINER greatly enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts' perceptions, PATIENT-{\Psi} is perceived to be closer to real patient interactions than GPT-4, and PATIENT-{\Psi}-TRAINER holds strong promise to improve trainee competencies. Our pioneering patient simulation training framework, using LLMs, holds great potential to enhance and advance mental health training, ultimately leading to improved patient care and outcomes. We will release all our data, code, and the training platform.
- [129] arXiv:2405.19661 [pdf, ps, html, other]
-
Title: MGCP: A Multi-Grained Correlation based Prediction Network for Multivariate Time SeriesSubjects: Machine Learning (cs.LG)
Multivariate time series prediction is widely used in daily life, which poses significant challenges due to the complex correlations that exist at multi-grained levels. Unfortunately, the majority of current time series prediction models fail to simultaneously learn the correlations of multivariate time series at multi-grained levels, resulting in suboptimal performance. To address this, we propose a Multi-Grained Correlations-based Prediction (MGCP) Network, which simultaneously considers the correlations at three granularity levels to enhance prediction performance. Specifically, MGCP utilizes Adaptive Fourier Neural Operators and Graph Convolutional Networks to learn the global spatiotemporal correlations and inter-series correlations, enabling the extraction of potential features from multivariate time series at fine-grained and medium-grained levels. Additionally, MGCP employs adversarial training with an attention mechanism-based predictor and conditional discriminator to optimize prediction results at coarse-grained level, ensuring high fidelity between the generated forecast results and the actual data distribution. Finally, we compare MGCP with several state-of-the-art time series prediction algorithms on real-world benchmark datasets, and our results demonstrate the generality and effectiveness of the proposed model.
- [130] arXiv:2405.19665 [pdf, ps, other]
-
Title: A novel fault localization with data refinement for hydroelectric unitsJialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai FanComments: 6pages,4 figures,Conference on Decision and Control(CDC) conferenceSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method.
- [131] arXiv:2405.19667 [pdf, ps, html, other]
-
Title: Reconciling Model Multiplicity for Downstream Decision MakingComments: 16 pages main body, 6 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-maker. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Finally, we provide a set of experiments to empirically evaluate our methods: compared to existing work, our proposed algorithm creates a pair of predictive models with both improved downstream decision-making losses and agrees on their best-response actions almost everywhere.
- [132] arXiv:2405.19668 [pdf, ps, other]
-
Title: AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided OptimizationComments: Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite the widespread application of large language models (LLMs) across various tasks, recent studies indicate that they are susceptible to jailbreak attacks, which can render their defense mechanisms ineffective. However, previous jailbreak research has frequently been constrained by limited universality, suboptimal efficiency, and a reliance on manual crafting. In response, we rethink the approach to jailbreaking LLMs and formally define three essential properties from the attacker' s perspective, which contributes to guiding the design of jailbreak methods. We further introduce AutoBreach, a novel method for jailbreaking LLMs that requires only black-box access. Inspired by the versatility of wordplay, AutoBreach employs a wordplay-guided mapping rule sampling strategy to generate a variety of universal mapping rules for creating adversarial prompts. This generation process leverages LLMs' automatic summarization and reasoning capabilities, thus alleviating the manual burden. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Additionally, we propose a two-stage mapping rule optimization strategy that initially optimizes mapping rules before querying target LLMs to enhance the efficiency of AutoBreach. AutoBreach can efficiently identify security vulnerabilities across various LLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and two LLMs' web platforms: Bingchat, GPT-4 Web, achieving an average success rate of over 80% with fewer than 10 queries
- [133] arXiv:2405.19669 [pdf, ps, html, other]
-
Title: Texture-guided Coding for Deep FeaturesSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the rapid development of machine vision technology in recent years, many researchers have begun to focus on feature compression that is better suited for machine vision tasks. The target of feature compression is deep features, which arise from convolution in the middle layer of a pre-trained convolutional neural network. However, due to the large volume of data and high level of abstraction of deep features, their application is primarily limited to machine-centric scenarios, which poses significant constraints in situations requiring human-computer interaction. This paper investigates features and textures and proposes a texture-guided feature compression strategy based on their characteristics. Specifically, the strategy comprises feature layers and texture layers. The feature layers serve the machine, including a feature selection module and a feature reconstruction network. With the assistance of texture images, they selectively compress and transmit channels relevant to visual tasks, reducing feature data while providing high-quality features for the machine. The texture layers primarily serve humans and consist of an image reconstruction network. This image reconstruction network leverages features and texture images to reconstruct preview images for humans. Our method fully exploits the characteristics of texture and features. It eliminates feature redundancy, reconstructs high-quality preview images for humans, and supports decision-making. The experimental results demonstrate excellent performance when employing our proposed method to compress the deep features.
- [134] arXiv:2405.19670 [pdf, ps, html, other]
-
Title: One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language ModelsComments: working in progress, repo: this https URLSubjects: Computation and Language (cs.CL)
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune the LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capacities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across nine question-answering tasks demonstrate the superiority of our approach.
- [135] arXiv:2405.19671 [pdf, ps, html, other]
-
Title: GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering with its high-quality rendering and real-time speed. However, when it comes to indoor scenes with a significant number of textureless areas, 3DGS yields incomplete and noisy reconstruction results due to the poor initialization of the point cloud and under-constrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we present a unified optimizing framework integrating neural SDF with 3DGS. This framework incorporates a learnable neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to accurately model scenes even with poor initialized point clouds. At the same time, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we regularize the optimization with normal and edge priors to eliminate geometry ambiguity in textureless areas and improve the details. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis.
- [136] arXiv:2405.19673 [pdf, ps, html, other]
-
Title: Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion ModelsMasatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso BiancalaniComments: Under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. Although prior work has explored similar avenues, they primarily focus on scenarios where accurate reward models are accessible. In contrast, we concentrate on an offline setting where a reward model is unknown, and we must learn from static offline datasets, a common scenario in scientific domains. In offline scenarios, existing approaches tend to suffer from overoptimization, as they may be misled by the reward model in out-of-distribution regions. To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions. Through empirical and theoretical analysis, we demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models while avoiding the generation of invalid designs through pre-trained diffusion models.
- [137] arXiv:2405.19675 [pdf, ps, html, other]
-
Title: Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents TrainingAisha Urooj Khan, John Garrett, Tyler Bradshaw, Lonie Salkowski, Jiwoong Jason Jeong, Amara Tariq, Imon BanerjeeSubjects: Computer Vision and Pattern Recognition (cs.CV)
A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.
- [138] arXiv:2405.19677 [pdf, ps, other]
-
Title: Large Language Model Watermark Stealing With Mixed Integer ProgrammingZhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shengshan Hu, Asif Gill, Shirui PanComments: 12 pagesSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
The Large Language Model (LLM) watermark is a newly emerging technique that shows promise in addressing concerns surrounding LLM copyright, monitoring AI-generated text, and preventing its misuse. The LLM watermark scheme commonly includes generating secret keys to partition the vocabulary into green and red lists, applying a perturbation to the logits of tokens in the green list to increase their sampling likelihood, thus facilitating watermark detection to identify AI-generated text if the proportion of green tokens exceeds a threshold. However, recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks, such as token editing, synonym substitution, and paraphrasing, with robustness declining as the number of keys increases. Therefore, the state-of-the-art watermark schemes that employ fewer or single keys have been demonstrated to be more robust against text editing and paraphrasing. In this paper, we propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme and systematically examine its vulnerability to this attack. We formalize the attack as a mixed integer programming problem with constraints. We evaluate our attack under a comprehensive threat model, including an extreme scenario where the attacker has no prior knowledge, lacks access to the watermark detector API, and possesses no information about the LLM's parameter settings or watermark injection/detection scheme. Extensive experiments on LLMs, such as OPT and LLaMA, demonstrate that our attack can successfully steal the green list and remove the watermark across all settings.
- [139] arXiv:2405.19678 [pdf, ps, html, other]
-
Title: View-Consistent Hierarchical 3D SegmentationUsing Ultrametric Feature FieldsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of ``coarse" or ``fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes.\footnote{The code and dataset are available at:
- [140] arXiv:2405.19679 [pdf, ps, html, other]
-
Title: Efficient Trajectory Inference in Wasserstein Space Using Consecutive AveragingSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Capturing data from dynamic processes through cross-sectional measurements is seen in many fields such as computational biology. Trajectory inference deals with the challenge of reconstructing continuous processes from such observations. In this work, we propose methods for B-spline approximation and interpolation of point clouds through consecutive averaging that is instrinsic to the Wasserstein space. Combining subdivision schemes with optimal transport-based geodesic, our methods carry out trajectory inference at a chosen level of precision and smoothness, and can automatically handle scenarios where particles undergo division over time. We rigorously evaluate our method by providing convergence guarantees and testing it on simulated cell data characterized by bifurcations and merges, comparing its performance against state-of-the-art trajectory inference and interpolation methods. The results not only underscore the effectiveness of our method in inferring trajectories, but also highlight the benefit of performing interpolation and approximation that respect the inherent geometric properties of the data.
- [141] arXiv:2405.19682 [pdf, ps, html, other]
-
Title: Fully Test-Time Adaptation for Monocular 3D Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Monocular 3D object detection (Mono 3Det) aims to identify 3D objects from a single RGB image. However, existing methods often assume training and test data follow the same distribution, which may not hold in real-world test scenarios. To address the out-of-distribution (OOD) problems, we explore a new adaptation paradigm for Mono 3Det, termed Fully Test-time Adaptation. It aims to adapt a well-trained model to unlabeled test data by handling potential data distribution shifts at test time without access to training data and test labels. However, applying this paradigm in Mono 3Det poses significant challenges due to OOD test data causing a remarkable decline in object detection scores. This decline conflicts with the pre-defined score thresholds of existing detection methods, leading to severe object omissions (i.e., rare positive detections and many false negatives). Consequently, the limited positive detection and plenty of noisy predictions cause test-time adaptation to fail in Mono 3Det. To handle this problem, we propose a novel Monocular Test-Time Adaptation (MonoTTA) method, based on two new strategies. 1) Reliability-driven adaptation: we empirically find that high-score objects are still reliable and the optimization of high-score objects can enhance confidence across all detections. Thus, we devise a self-adaptive strategy to identify reliable objects for model adaptation, which discovers potential objects and alleviates omissions. 2) Noise-guard adaptation: since high-score objects may be scarce, we develop a negative regularization term to exploit the numerous low-score objects via negative learning, preventing overfitting to noise and trivial solutions. Experimental results show that MonoTTA brings significant performance gains for Mono 3Det models in OOD test scenarios, approximately 190% gains by average on KITTI and 198% gains on nuScenes.
- [142] arXiv:2405.19683 [pdf, ps, html, other]
-
Title: Breaking Indistinguishability with Transfer Learning: A First Look at SPECK32/64 Lightweight Block CiphersSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
In this research, we introduce MIND-Crypt, a novel attack framework that uses deep learning (DL) and transfer learning (TL) to challenge the indistinguishability of block ciphers, specifically SPECK32/64 encryption algorithm in CBC mode (Cipher Block Chaining) against Known Plaintext Attacks (KPA). Our methodology includes training a DL model with ciphertexts of two messages encrypted using the same key. The selected messages have the same byte-length and differ by only one bit at the binary level. This DL model employs a residual network architecture. For the TL, we use the trained DL model as a feature extractor, and these features are then used to train a shallow machine learning, such as XGBoost. This dual strategy aims to distinguish ciphertexts of two encrypted messages, addressing traditional cryptanalysis challenges.
Our findings demonstrate that the DL model achieves an accuracy of approximately 99% under consistent cryptographic conditions (Same Key or Rounds) with the SPECK32/64 cipher. However, performance degrades to random guessing levels (50%) when tested with ciphertext generated from different keys or different encryption rounds of SPECK32/64. To enhance the results, the DL model requires retraining with different keys or encryption rounds using larger datasets (10^7 samples). To overcome this limitation, we implement TL, achieving an accuracy of about 53% with just 10,000 samples, which is better than random guessing. Further training with 580,000 samples increases accuracy to nearly 99%, showing a substantial reduction in data requirements by over 94%. This shows that an attacker can utilize machine learning models to break indistinguishability by accessing pairs of plaintexts and their corresponding ciphertexts encrypted with the same key, without directly interacting with the communicating parties. - [143] arXiv:2405.19684 [pdf, ps, html, other]
-
Title: A Comprehensive Survey on Underwater Image Enhancement Based on Deep LearningComments: A survey on the underwater image enhancement taskSubjects: Computer Vision and Pattern Recognition (cs.CV)
Underwater image enhancement (UIE) is a challenging research task in the field of computer vision. Although hundreds of UIE algorithms have been proposed, a comprehensive and systematic review is still lacking. To promote future research, we summarize the UIE task from multiple perspectives. First, the physical models, data construction processes, evaluation metrics, and loss functions are introduced. Second, according to the contributions brought by different literatures, recent proposed algorithms are discussed and classified from six perspectives, namely network architecture, learning strategy, learning stage, assistance task, domain perspective and disentanglement fusion, respectively. Third, considering the inconsistencies in experimental settings in different literatures, a comprehensive and fair comparison does not yet exist. To this end, we quantitatively and qualitatively evaluate state-of-the-art algorithms on multiple benchmark datasets. Finally, issues worthy of further research in the UIE task are raised. A collection of useful materials is available at this https URL.
- [144] arXiv:2405.19686 [pdf, ps, html, other]
-
Title: Knowledge Graph Tuning: Real-time Large Language Model Personalization based on Human FeedbackSubjects: Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated remarkable proficiency in a range of natural language processing tasks. Once deployed, LLMs encounter users with personalized factual knowledge, and such personalized knowledge is consistently reflected through users' interactions with the LLMs. To enhance user experience, real-time model personalization is essential, allowing LLMs to adapt user-specific knowledge based on user feedback during human-LLM interactions. Existing methods mostly require back-propagation to finetune the model parameters, which incurs high computational and memory costs. In addition, these methods suffer from low interpretability, which will cause unforeseen impacts on model performance during long-term use, where the user's personalized knowledge is accumulated this http URL address these challenges, we propose Knowledge Graph Tuning (KGT), a novel approach that leverages knowledge graphs (KGs) to personalize LLMs. KGT extracts personalized factual knowledge triples from users' queries and feedback and optimizes KGs without modifying the LLM parameters. Our method improves computational and memory efficiency by avoiding back-propagation and ensures interpretability by making the KG adjustments comprehensible to humans.Experiments with state-of-the-art LLMs, including GPT-2, Llama2, and Llama3, show that KGT significantly improves personalization performance while reducing latency and GPU memory costs. Ultimately, KGT offers a promising solution of effective, efficient, and interpretable real-time LLM personalization during user interactions with the LLMs.
- [145] arXiv:2405.19687 [pdf, ps, html, other]
-
Title: Autonomous Driving with Spiking Neural NetworksSubjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
Autonomous driving demands an integrated approach that encompasses perception, prediction, and planning, all while operating under strict energy constraints to enhance scalability and environmental sustainability. We present Spiking Autonomous Driving (\name{}), the first unified Spiking Neural Network (SNN) to address the energy challenges faced by autonomous driving systems through its event-driven and energy-efficient nature. SAD is trained end-to-end and consists of three main modules: perception, which processes inputs from multi-view cameras to construct a spatiotemporal bird's eye view; prediction, which utilizes a novel dual-pathway with spiking neurons to forecast future states; and planning, which generates safe trajectories considering predicted occupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset, SAD achieves competitive performance in perception, prediction, and planning tasks, while drawing upon the energy efficiency of SNNs. This work highlights the potential of neuromorphic computing to be applied to energy-efficient autonomous driving, a critical step toward sustainable and safety-critical automotive technology. Our code is available at \url{this https URL}.
- [146] arXiv:2405.19688 [pdf, ps, html, other]
-
Title: DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric DetailsHaitao Cao, Baoping Cheng, Qiran Pu, Haocheng Zhang, Bin Luo, Yixiang Zhuang, Juncong Lin, Liyan Chen, Xuan ChengSubjects: Computer Vision and Pattern Recognition (cs.CV)
Parametric 3D models have enabled a wide variety of computer vision and graphics tasks, such as modeling human faces, bodies and hands. In 3D face modeling, 3DMM is the most widely used parametric model, but can't generate fine geometric details solely from identity and expression inputs. To tackle this limitation, we propose a neural parametric model named DNPM for the facial geometric details, which utilizes deep neural network to extract latent codes from facial displacement maps encoding details and wrinkles. Built upon DNPM, a novel 3DMM named Detailed3DMM is proposed, which augments traditional 3DMMs by including the synthesis of facial details only from the identity and expression inputs. Moreover, we show that DNPM and Detailed3DMM can facilitate two downstream applications: speech-driven detailed 3D facial animation and 3D face reconstruction from a degraded image. Extensive experiments have shown the usefulness of DNPM and Detailed3DMM, and the progressiveness of two proposed applications.
- [147] arXiv:2405.19689 [pdf, ps, html, other]
-
Title: Uncertainty-aware sign language video retrieval with probability distribution modelingSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).
- [148] arXiv:2405.19690 [pdf, ps, html, other]
-
Title: Diffusion Policies creating a Trust Region for Offline Reinforcement LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Offline reinforcement learning (RL) leverages pre-collected datasets to train optimal policies. Diffusion Q-Learning (DQL), introducing diffusion models as a powerful and expressive policy class, significantly boosts the performance of offline RL. However, its reliance on iterative denoising sampling to generate actions slows down both training and inference. While several recent attempts have tried to accelerate diffusion-QL, the improvement in training and/or inference speed often results in degraded performance. In this paper, we introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. We bridge the two polices by a newly introduced diffusion trust region loss. The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We evaluate its effectiveness and algorithmic characteristics against popular Kullback-Leibler (KL) based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. The PyTorch implementation will be made available.
- [149] arXiv:2405.19691 [pdf, ps, html, other]
-
Title: Designing Prompt Analytics Dashboards to Analyze Student-ChatGPT Interactions in EFL WritingMinsun Kim, SeonGyeom Kim, Suyoun Lee, Yoosang Yoon, Junho Myung, Haneul Yoo, Hyungseung Lim, Jieun Han, Yoonsu Kim, So-Yeon Ahn, Juho Kim, Alice Oh, Hwajung Hong, Tak Yeon LeeSubjects: Human-Computer Interaction (cs.HC)
While ChatGPT has significantly impacted education by offering personalized resources for students, its integration into educational settings poses unprecedented risks, such as inaccuracies and biases in AI-generated content, plagiarism and over-reliance on AI, and privacy and security issues. To help teachers address such risks, we conducted a two-phase iterative design process that comprises surveys, interviews, and prototype demonstration involving six EFL (English as a Foreign Language) teachers, who integrated ChatGPT into semester-long English essay writing classes. Based on the needs identified during the initial survey and interviews, we developed a prototype of Prompt Analytics Dashboard (PAD) that integrates the essay editing history and chat logs between students and ChatGPT. Teacher's feedback on the prototype informs additional features and unmet needs for designing future PAD, which helps them (1) analyze contextual analysis of student behaviors, (2) design an overall learning loop, and (3) develop their teaching skills.
- [150] arXiv:2405.19694 [pdf, ps, html, other]
-
Title: Grade Like a Human: Rethinking Automated Assessment with Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process.
In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs. - [151] arXiv:2405.19695 [pdf, ps, html, other]
-
Title: Distribution Aligned Semantics Adaption for Lifelong Person Re-IdentificationSubjects: Computer Vision and Pattern Recognition (cs.CV)
In real-world scenarios, person Re-IDentification (Re-ID) systems need to be adaptable to changes in space and time. Therefore, the adaptation of Re-ID models to new domains while preserving previously acquired knowledge is crucial, known as Lifelong person Re-IDentification (LReID). Advanced LReID methods rely on replaying exemplars from old domains and applying knowledge distillation in logits with old models. However, due to privacy concerns, retaining previous data is inappropriate. Additionally, the fine-grained and open-set characteristics of Re-ID limit the effectiveness of the distillation paradigm for accumulating knowledge. We argue that a Re-ID model trained on diverse and challenging pedestrian images at a large scale can acquire robust and general human semantic knowledge. These semantics can be readily utilized as shared knowledge for lifelong applications. In this paper, we identify the challenges and discrepancies associated with adapting a pre-trained model to each application domain, and introduce the Distribution Aligned Semantics Adaption (DASA) framework. It efficiently adjusts Batch Normalization (BN) to mitigate interference from data distribution discrepancy and freezes the pre-trained convolutional layers to preserve shared knowledge. Additionally, we propose the lightweight Semantics Adaption (SA) module, which effectively adapts learned semantics to enhance pedestrian representations. Extensive experiments demonstrate the remarkable superiority of our proposed framework over advanced LReID methods, and it exhibits significantly reduced storage consumption. DASA presents a novel and cost-effective perspective on effectively adapting pre-trained models for LReID.
- [152] arXiv:2405.19699 [pdf, ps, html, other]
-
Title: Fairness in AI-Driven Recruitment: Challenges, Metrics, Methods, and Future DirectionsComments: Submitted to IEEE Transactions on Technology and SocietySubjects: Computers and Society (cs.CY)
The recruitment process is crucial to an organization's ability to position itself for success, from finding qualified and well-fitting job candidates to impacting its output and culture. Therefore, over the past century, human resources experts and industrial-organizational psychologists have established hiring practices such as attracting candidates with job ads, gauging a candidate's skills with assessments, and using interview questions to assess organizational fit. However, the advent of big data and machine learning has led to a rapid transformation in the traditional recruitment process as many organizations have moved to using artificial intelligence (AI). Given the prevalence of AI-based recruitment, there is growing concern that human biases may carry over to decisions made by these systems, which can amplify the effect through systematic application. Empirical studies have identified prevalent biases in candidate ranking software and chatbot interactions, catalyzing a growing body of research dedicated to AI fairness over the last decade. This paper provides a comprehensive overview of this emerging field by discussing the types of biases encountered in AI-driven recruitment, exploring various fairness metrics and mitigation methods, and examining tools for auditing these systems. We highlight current challenges and outline future directions for developing fair AI recruitment applications, ensuring equitable candidate treatment and enhancing organizational outcomes.
- [153] arXiv:2405.19701 [pdf, ps, html, other]
-
Title: Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine TranslationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Gender bias in machine translation (MT) systems poses a significant challenge to achieving accurate and inclusive translations. This paper examines gender bias in machine translation systems for languages such as Telugu and Kannada from the Dravidian family, analyzing how gender inflections affect translation accuracy and neutrality using Google Translate and ChatGPT. It finds that while plural forms can reduce bias, individual-centric sentences often maintain the bias due to historical stereotypes. The study evaluates the Chain of Thought processing, noting significant bias mitigation from 80% to 4% in Telugu and from 40% to 0% in Kannada. It also compares Telugu and Kannada translations, emphasizing the need for language specific strategies to address these challenges and suggesting directions for future research to enhance fairness in both data preparation and prompts during inference.
- [154] arXiv:2405.19703 [pdf, ps, html, other]
-
Title: Towards a Better Evaluation of Out-of-Domain GeneralizationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.
- [155] arXiv:2405.19705 [pdf, ps, html, other]
-
Title: Universal Online Convex Optimization with $1$ Projection per RoundSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
To address the uncertainty in function types, recent progress in online convex optimization (OCO) has spurred the development of universal algorithms that simultaneously attain minimax rates for multiple types of convex functions. However, for a $T$-round online problem, state-of-the-art methods typically conduct $O(\log T)$ projections onto the domain in each round, a process potentially time-consuming with complicated feasible sets. In this paper, inspired by the black-box reduction of Cutkosky and Orabona (2018), we employ a surrogate loss defined over simpler domains to develop universal OCO algorithms that only require $1$ projection. Embracing the framework of prediction with expert advice, we maintain a set of experts for each type of functions and aggregate their predictions via a meta-algorithm. The crux of our approach lies in a uniquely designed expert-loss for strongly convex functions, stemming from an innovative decomposition of the regret into the meta-regret and the expert-regret. Our analysis sheds new light on the surrogate loss, facilitating a rigorous examination of the discrepancy between the regret of the original loss and that of the surrogate loss, and carefully controlling meta-regret under the strong convexity condition. In this way, with only $1$ projection per round, we establish optimal regret bounds for general convex, exponentially concave, and strongly convex functions simultaneously. Furthermore, we enhance the expert-loss to exploit the smoothness property, and demonstrate that our algorithm can attain small-loss regret for multiple types of convex and smooth functions.
- [156] arXiv:2405.19706 [pdf, ps, html, other]
-
Title: Bridging eResearch Infrastructure and Experimental Materials Science Process in the Quantum Data HubSubjects: Software Engineering (cs.SE); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
Experimental materials science is experiencing significant growth due to automated experimentation and AI techniques. Integrated autonomous platforms are emerging, combining generative models, robotics, simulations, and automated systems for material synthesis. However, two major challenges remain: democratizing access to these technologies and creating accessible infrastructure for under-resourced scientists. This paper introduces the Quantum Data Hub (QDH), a community-accessible research infrastructure aimed at researchers working with quantum materials. QDH integrates with the National Data Platform, adhering to FAIR principles while proposing additional UNIT principles for usability, navigability, interpretability, and timeliness. The QDH facilitates collaboration and extensibility, allowing seamless integration of new researchers, instruments, and data into the system.
- [157] arXiv:2405.19707 [pdf, ps, html, other]
-
Title: DeMamba: AI-Generated Video Detection on Million-Scale GenVideo BenchmarkHaoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, Huaxiong LiSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{this https URL}.
- [158] arXiv:2405.19708 [pdf, ps, html, other]
-
Title: Text Guided Image Editing with Automatic Concept Locating and ForgettingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
- [159] arXiv:2405.19711 [pdf, ps, other]
-
Title: SimiSketch: Efficiently Estimating Similarity of streaming MultisetsSubjects: Data Structures and Algorithms (cs.DS)
The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around hashing techniques, which are well-suited for sets but less naturally adaptable to multisets, a common occurrence in scenarios like network streams and text data. Moreover, with the increasing prevalence of data arriving in streaming patterns, many existing methods struggle to handle cases where set items are presented in a continuous stream. Consequently, our focus in this paper is on the challenging scenario of multisets with item streams. To address this, we propose SimiSketch, a sketching algorithm designed to tackle this specific problem. The paper begins by presenting two simpler versions that employ intuitive sketches for similarity estimation. Subsequently, we formally introduce SimiSketch and leverage SALSA to enhance accuracy. To validate our algorithms, we conduct extensive testing on synthetic datasets, real-world network traffic, and text articles. Our experiment shows that compared with the state-of-the-art, SimiSketch can improve the accuracy by up to 42 times, and increase the throughput by up to 360 times. The complete source code is open-sourced and available on GitHub for reference.
- [160] arXiv:2405.19712 [pdf, ps, html, other]
-
Title: HINT: Learning Complete Human Neural Representations from Limited ViewpointsSubjects: Computer Vision and Pattern Recognition (cs.CV)
No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.
- [161] arXiv:2405.19713 [pdf, ps, html, other]
-
Title: Summing divergent matrix seriesComments: 39 pages, 11 figuresSubjects: Numerical Analysis (math.NA); Classical Analysis and ODEs (math.CA)
We extend several celebrated methods in classical analysis for summing series of complex numbers to series of complex matrices. These include the summation methods of Abel, Borel, Cesáro, Euler, Lambert, Nörlund, and Mittag-Leffler, which are frequently used to sum scalar series that are divergent in the conventional sense. One feature of our matrix extensions is that they are fully noncommutative generalizations of their scalar counterparts -- not only is the scalar series replaced by a matrix series, positive weights are replaced by positive definite matrix weights, order on $\mathbb{R}$ replaced by Loewner order, exponential function replaced by matrix exponential function, etc. We will establish the regularity of our matrix summation methods, i.e., when applied to a matrix series convergent in the conventional sense, we obtain the same value for the sum. Our second goal is to provide numerical algorithms that work in conjunction with these summation methods. We discuss how the block and mixed-block summation algorithms, the Kahan compensated summation algorithm, may be applied to matrix sums with similar roundoff error bounds. These summation methods and algorithms apply not only to power or Taylor series of matrices but to any general matrix series including matrix Fourier and Dirichlet series. We will demonstrate the utility of these summation methods: establishing a Fejér's theorem and alleviating the Gibbs phenomenon for matrix Fourier series; extending the domains of matrix functions and accurately evaluating them; enhancing the matrix Padé approximation and Schur--Parlett algorithms; and more.
- [162] arXiv:2405.19715 [pdf, ps, html, other]
-
Title: SpecDec++: Boosting Speculative Decoding via Adaptive Candidate LengthsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
- [163] arXiv:2405.19716 [pdf, ps, html, other]
-
Title: Enhancing Large Vision Language Models with Self-Training on Image ComprehensionComments: 19 pages, 14 figures, 6 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.
- [164] arXiv:2405.19718 [pdf, ps, html, other]
-
Title: LED: A Large-scale Real-world Paired Dataset for Event Camera DenoisingComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Event camera has significant advantages in capturing dynamic scene information while being prone to noise interference, particularly in challenging conditions like low threshold and low illumination. However, most existing research focuses on gentle situations, hindering event camera applications in realistic complex scenarios. To tackle this limitation and advance the field, we construct a new paired real-world event denoising dataset (LED), including 3K sequences with 18K seconds of high-resolution (1200*680) event streams and showing three notable distinctions compared to others: diverse noise levels and scenes, larger-scale with high-resolution, and high-quality GT. Specifically, it contains stepped parameters and varying illumination with diverse scenarios. Moreover, based on the property of noise events inconsistency and signal events consistency, we propose a novel effective denoising framework(DED) using homogeneous dual events to generate the GT with better separating noise from the raw. Furthermore, we design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholds to realize accurate denoising. The experimental results demonstrate that the remarkable performance of the proposed approach on different datasets.The dataset and code are at this https URL.
- [165] arXiv:2405.19722 [pdf, ps, html, other]
-
Title: QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual ClusteringSubjects: Computer Vision and Pattern Recognition (cs.CV)
Unsupervised vision clustering, a cornerstone in computer vision, has been studied for decades, yielding significant outcomes across numerous vision tasks. However, these algorithms involve substantial computational demands when confronted with vast amounts of unlabeled data. Conversely, Quantum computing holds promise in expediting unsupervised algorithms when handling large-scale databases. In this study, we introduce QClusformer, a pioneering Transformer-based framework leveraging Quantum machines to tackle unsupervised vision clustering challenges. Specifically, we design the Transformer architecture, including the self-attention module and transformer blocks, from a Quantum perspective to enable execution on Quantum hardware. In addition, we present QClusformer, a variant based on the Transformer architecture, tailored for unsupervised vision clustering tasks. By integrating these elements into an end-to-end framework, QClusformer consistently outperforms previous methods running on classical computers. Empirical evaluations across diverse benchmarks, including MS-Celeb-1M and DeepFashion, underscore the superior performance of QClusformer compared to state-of-the-art methods.
- [166] arXiv:2405.19723 [pdf, ps, html, other]
-
Title: Encoding and Controlling Global Semantics for Long-form Video Question AnsweringComments: Work in progressSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets.
- [167] arXiv:2405.19726 [pdf, ps, html, other]
-
Title: Streaming Video Diffusion: Online Video Editing with Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.
- [168] arXiv:2405.19727 [pdf, ps, html, other]
-
Title: Automatic Dance Video Segmentation for Understanding ChoreographyComments: 9 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.
- [169] arXiv:2405.19729 [pdf, ps, html, other]
-
Title: Dynamic feature selection in medical predictive monitoring by reinforcement learningComments: preview versionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In this paper, we investigate dynamic feature selection within multivariate time-series scenario, a common occurrence in clinical prediction monitoring where each feature corresponds to a bio-test result. Many existing feature selection methods fall short in effectively leveraging time-series information, primarily because they are designed for static data. Our approach addresses this limitation by enabling the selection of time-varying feature subsets for each patient. Specifically, we employ reinforcement learning to optimize a policy under maximum cost restrictions. The prediction model is subsequently updated using synthetic data generated by trained policy. Our method can seamlessly integrate with non-differentiable prediction models. We conducted experiments on a sizable clinical dataset encompassing regression and classification tasks. The results demonstrate that our approach outperforms strong feature selection baselines, particularly when subjected to stringent cost limitations. Code will be released once paper is accepted.
- [170] arXiv:2405.19730 [pdf, ps, other]
-
Title: Research on Foundation Model for Spatial Data Intelligence: China's 2024 White Paper on Strategic Development of Spatial Data IntelligenceShaohua Wang (1), Xing Xie (2), Yong Li (3), Danhuai Guo (4), Zhi Cai (5), Yu Liu (6), Yang Yue (7), Xiao Pan (8), Feng Lu (9), Huayi Wu (10), Zhipeng Gui (10), Zhiming Ding (11), Bolong Zheng (12), Fuzheng Zhang (13), Tao Qin (2), Jingyuan Wang (14), Chuang Tao (15), Zhengchao Chen (1), Hao Lu (16), Jiayi Li (10), Hongyang Chen (17), Peng Yue (10), Wenhao Yu (18), Yao Yao (18), Leilei Sun (14), Yong Zhang (5), Longbiao Chen (19), Xiaoping Du (20), Xiang Li (21), Xueying Zhang (22), Kun Qin (10), Zhaoya Gong (6), Weihua Dong (23), Xiaofeng Meng (24) ((1) Aerospace Information Research Institute, Chinese Academy of Sciences,(2) Microsoft Research Asia, (3) Tsinghua University, (4) Beijing University of Chemical Technology, (5) Beijing University of Technology, (6) Peking University, (7) Shenzhen University, (8) Shijiazhuang Tiedao University, (9) Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, (10) Wuhan University, (11) Institute of Software, Chinese Academy of Sciences, (12) Huazhong University of Science and Technology, (13) Kuaishou Natural Language Processing Center and Audio Center, (14) Beijing University of Aeronautics and Astronautics, (15) Shanghai Figure Interesting Information Technology Co., Ltd., (16) SuperMap Software Co., Ltd., (17) Zhejiang Lab, (18) China University of Geosciences (Wuhan), (19) Xiamen University, (20) Key Laboratory of Digital Earth, Chinese Academy of Sciences, (21) East China Normal University, (22) Nanjing Normal University, (23) Beijing Normal University, (24) Renmin University of China)Comments: in Chinese languageSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.
- [171] arXiv:2405.19731 [pdf, ps, html, other]
-
Title: Some New Approaches to MPI ImplementationsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI.
- [172] arXiv:2405.19732 [pdf, ps, html, other]
-
Title: Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based OptimizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at this https URL.
- [173] arXiv:2405.19735 [pdf, ps, html, other]
-
Title: Twin Deformable Point Convolutions for Point Cloud Semantic Segmentation in Remote Sensing ScenesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Thanks to the application of deep learning technology in point cloud processing of the remote sensing field, point cloud segmentation has become a research hotspot in recent years, which can be applied to real-world 3D, smart cities, and other fields. Although existing solutions have made unprecedented progress, they ignore the inherent characteristics of point clouds in remote sensing fields that are strictly arranged according to latitude, longitude, and altitude, which brings great convenience to the segmentation of point clouds in remote sensing fields. To consider this property cleverly, we propose novel convolution operators, termed Twin Deformable point Convolutions (TDConvs), which aim to achieve adaptive feature learning by learning deformable sampling points in the latitude-longitude plane and altitude direction, respectively. First, to model the characteristics of the latitude-longitude plane, we propose a Cylinder-wise Deformable point Convolution (CyDConv) operator, which generates a two-dimensional cylinder map by constructing a cylinder-like grid in the latitude-longitude direction. Furthermore, to better integrate the features of the latitude-longitude plane and the spatial geometric features, we perform a multi-scale fusion of the extracted latitude-longitude features and spatial geometric features, and realize it through the aggregation of adjacent point features of different scales. In addition, a Sphere-wise Deformable point Convolution (SpDConv) operator is introduced to adaptively offset the sampling points in three-dimensional space by constructing a sphere grid structure, aiming at modeling the characteristics in the altitude direction. Experiments on existing popular benchmarks conclude that our TDConvs achieve the best segmentation performance, surpassing the existing state-of-the-art methods.
- [174] arXiv:2405.19736 [pdf, ps, html, other]
-
Title: Learning Task-relevant Sequence Representations via Intrinsic Dynamics Characteristics in Reinforcement LearningSubjects: Artificial Intelligence (cs.AI)
Learning task-relevant state representations is crucial to solving the problem of scene generalization in visual deep reinforcement learning. Prior work typically establishes a self-supervised auxiliary learner, introducing elements (e.g., rewards and actions) to extract task-relevant state information from observations through behavioral similarity metrics. However, the methods often ignore the inherent relationships between the elements (e.g., dynamics relationships) that are essential for learning accurate representations, and they are also limited to single-step metrics, which impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To solve the issues, we propose an intrinsic dynamic characteristics-driven sequence representation learning method (DSR) over a common DRL frame. Concretely, inspired by the fact of state transition in the underlying system, it constrains the optimization of the encoder via modeling the dynamics equations related to the state transition, which prompts the latent encoding information to satisfy the state transition process and thereby distinguishes state space and noise space. Further, to refine the ability of encoding similar tasks based on dynamics constraints, DSR also sequentially models inherent dynamics equation relationships from the perspective of sequence elements' frequency domain and multi-step prediction. Finally, experimental results show that DSR has achieved a significant performance boost in the Distracting DMControl Benchmark, with an average of 78.9% over the backbone baseline. Further results indicate that it also achieves the best performance in real-world autonomous driving tasks in the CARLA simulator. Moreover, the qualitative analysis results of t-SNE visualization validate that our method possesses superior representation ability on visual tasks.
- [175] arXiv:2405.19737 [pdf, ps, html, other]
-
Title: Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning DistillationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($\approx 4.7\%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbf{E}-\textbf{D}riven key reason\textbf{I}ng step distilla\textbf{T}ion (\textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs\footnote{Code can be found at \url{this https URL}}.
- [176] arXiv:2405.19740 [pdf, ps, html, other]
-
Title: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant PerturbationsComments: 23 pages, 12 figures, 10 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Expert-designed close-ended benchmarks serve as vital tools in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of transition analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 21% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, potentially being resolved through rote memorization and leading to inflated performance. We also find that the detailed transition analyses by PertEval could illuminate weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Given these insights, we posit that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs, marking a significant step toward more trustworthy LLM evaluation.
- [177] arXiv:2405.19743 [pdf, ps, html, other]
-
Title: May the Dance be with You: Dance Generation Framework for Non-HumanoidsComments: 13 pages, 6 Figures, Rejected at Neurips 2023Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
We hypothesize dance as a motion that forms a visual rhythm from music, where the visual rhythm can be perceived from an optical flow. If an agent can recognize the relationship between visual rhythm and music, it will be able to dance by generating a motion to create a visual rhythm that matches the music. Based on this, we propose a framework for any kind of non-humanoid agents to learn how to dance from human videos. Our framework works in two processes: (1) training a reward model which perceives the relationship between optical flow (visual rhythm) and music from human dance videos, (2) training the non-humanoid dancer based on that reward model, and reinforcement learning. Our reward model consists of two feature encoders for optical flow and music. They are trained based on contrastive learning which makes the higher similarity between concurrent optical flow and music features. With this reward model, the agent learns dancing by getting a higher reward when its action creates an optical flow whose feature has a higher similarity with the given music feature. Experiment results show that generated dance motion can align with the music beat properly, and user study result indicates that our framework is more preferred by humans compared to the baselines. To the best of our knowledge, our work of non-humanoid agents which learn dance from human videos is unprecedented. An example video can be found at this https URL.
- [178] arXiv:2405.19744 [pdf, ps, html, other]
-
Title: X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual InstructionsComments: ACL 2024. Our codes, data and model weights are available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
- [179] arXiv:2405.19745 [pdf, ps, html, other]
-
Title: GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View SynthesisComments: Accepted to SIGGRAPH 2024 Conference. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.
- [180] arXiv:2405.19746 [pdf, ps, html, other]
-
Title: DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape RepresentationRon Keuth, Lasse Hansen, Maren Balks, Ronja Jäger, Anne-Nele Schröder, Ludger Tüshaus, Mattias HeinrichSubjects: Computer Vision and Pattern Recognition (cs.CV)
Purpose: Semantic segmentation and landmark detection are fundamental tasks of medical image processing, facilitating further analysis of anatomical objects. Although deep learning-based pixel-wise classification has set a new-state-of-the-art for segmentation, it falls short in landmark detection, a strength of shape-based approaches.
Methods: In this work, we propose a dense image-to-shape representation that enables the joint learning of landmarks and semantic segmentation by employing a fully convolutional architecture. Our method intuitively allows the extraction of arbitrary landmarks due to its representation of anatomical correspondences. We benchmark our method against the state-of-the-art for semantic segmentation (nnUNet), a shape-based approach employing geometric deep learning and a CNN-based method for landmark detection.
Results: We evaluate our method on two medical dataset: one common benchmark featuring the lungs, heart, and clavicle from thorax X-rays, and another with 17 different bones in the paediatric wrist. While our method is on pair with the landmark detection baseline in the thorax setting (error in mm of $2.6\pm0.9$ vs $2.7\pm0.9$), it substantially surpassed it in the more complex wrist setting ($1.1\pm0.6$ vs $1.9\pm0.5$).
Conclusion: We demonstrate that dense geometric shape representation is beneficial for challenging landmark detection tasks and outperforms previous state-of-the-art using heatmap regression. While it does not require explicit training on the landmarks themselves, allowing for the addition of new landmarks without necessitating retraining.} - [181] arXiv:2405.19747 [pdf, ps, other]
-
Title: Understanding and mitigating difficulties in posterior predictive evaluationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Predictive posterior densities (PPDs) are of interest in approximate Bayesian inference. Typically, these are estimated by simple Monte Carlo (MC) averages using samples from the approximate posterior. We observe that the signal-to-noise ratio (SNR) of such estimators can be extremely low. An analysis for exact inference reveals SNR decays exponentially as there is an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of the test data relative to the training data. Further analysis extends these results to approximate inference. To remedy the low SNR problem, we propose replacing simple MC sampling with importance sampling using a proposal distribution optimized at test time on a variational proxy for the SNR and demonstrate that this yields greatly improved estimates.
- [182] arXiv:2405.19749 [pdf, ps, html, other]
-
Title: Generating Query Recommendations via LLMsComments: Generating Query Recommendations via LLMsSubjects: Information Retrieval (cs.IR)
Query recommendation systems are ubiquitous in modern search engines, assisting users in producing effective queries to meet their information needs. However, these systems require a large amount of data to produce good recommendations, such as a large collection of documents to index and query logs. In particular, query logs and user data are not available in cold start scenarios. Query logs are expensive to collect and maintain and require complex and time-consuming cascading pipelines for creating, combining, and ranking recommendations. To address these issues, we frame the query recommendation problem as a generative task, proposing a novel approach called Generative Query Recommendation (GQR). GQR uses an LLM as its foundation and does not require to be trained or fine-tuned to tackle the query recommendation problem. We design a prompt that enables the LLM to understand the specific recommendation task, even using a single example. We then improved our system by proposing a version that exploits query logs called Retriever-Augmented GQR (RA-GQR). RA-GQr dynamically composes its prompt by retrieving similar queries from query logs. GQR approaches reuses a pre-existing neural architecture resulting in a simpler and more ready-to-market approach, even in a cold start scenario. Our proposed GQR obtains state-of-the-art performance in terms of NDCG@10 and clarity score against two commercial search engines and the previous state-of-the-art approach on the Robust04 and ClueWeb09B collections, improving on average the NDCG@10 performance up to ~4% on Robust04 and ClueWeb09B w.r.t the previous best competitor. RA-GQR further improve the NDCG@10 obtaining an increase of ~11%, ~6\% on Robust04 and ClueWeb09B w.r.t the best competitor. Furthermore, our system obtained ~59% of user preferences in a blind user study, proving that our method produces the most engaging queries.
- [183] arXiv:2405.19751 [pdf, ps, html, other]
-
Title: HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid QuantizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.
- [184] arXiv:2405.19752 [pdf, ps, other]
-
Title: Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed BanditsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study the stochastic multi-armed bandit problem in the $P$-pass streaming model. In this problem, the $n$ arms are present in a stream and at most $m<n$ arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of $m, n$ and $P$. Specifically, we design an algorithm with $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ regret and complement it with an $\tilde \Omega\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ lower bound when the number of rounds $T$ is sufficiently large. Our results are tight up to a logarithmic factor in $n$ and $P$.
- [185] arXiv:2405.19754 [pdf, ps, html, other]
-
Title: Mitigating annotation shift in cancer classification using single image generative modelsComments: Preprint of paper accepted at SPIE IWBI 2024 ConferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Artificial Intelligence (AI) has emerged as a valuable tool for assisting radiologists in breast cancer detection and diagnosis. However, the success of AI applications in this domain is restricted by the quantity and quality of available data, posing challenges due to limited and costly data annotation procedures that often lead to annotation shifts. This study simulates, analyses and mitigates annotation shifts in cancer classification in the breast mammography domain. First, a high-accuracy cancer risk prediction model is developed, which effectively distinguishes benign from malignant lesions. Next, model performance is used to quantify the impact of annotation shift. We uncover a substantial impact of annotation shift on multiclass classification performance particularly for malignant lesions. We thus propose a training data augmentation approach based on single-image generative models for the affected class, requiring as few as four in-domain annotations to considerably mitigate annotation shift, while also addressing dataset imbalance. Lastly, we further increase performance by proposing and validating an ensemble architecture based on multiple models trained under different data augmentation regimes. Our study offers key insights into annotation shift in deep learning breast cancer classification and explores the potential of single-image generative models to overcome domain shift challenges.
- [186] arXiv:2405.19755 [pdf, ps, html, other]
-
Title: Developing a Comprehensive Measurement Tool for Assessing the Rate of BIM Adoption in the Construction IndustrySubjects: Robotics (cs.RO)
Building Information Modeling (BIM) is a crucial technology in the construction industry, offering benefits such as enhanced collaboration, real-time decision-making, and significant cost and time savings. Despite its advantages, BIM adoption faces numerous barriers. This study aims to create a reliable tool to assess the Rate of BIM Adoption (RBA), drawing on Attributes of Innovation theory and empirical data from the literature. This research integrates theoretical insights with empirical data, providing quantitative items to measure BAR in the construction industry. The quantitative approach helps decision-makers and policymakers to mandate BIM and establish appropriate implementation standards. Its implications are significant for the construction industry, policymakers, and the academic community, offering a systematic approach to assess BIM adoption, identify barriers, and implement targeted strategies. The reliability of this approach is ensured through a solid theoretical foundation, item development, pilot testing, and statistical analysis, making it a valuable resource for improving BIM implementation and fostering innovation in the construction industry.
- [187] arXiv:2405.19757 [pdf, ps, html, other]
-
Title: Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise FilteringSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recent advances in a generative neural network model extend the development of data augmentation methods. However, the augmentation methods based on the modern generative models fail to achieve notable performance for class imbalance data compared to the conventional model, the SMOTE. We investigate the problem of the generative model for imbalanced classification and introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE). Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Then, the data points potentially degrading the augmentation are systematically excluded, and the neighboring observations are directly augmented on the data space. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models. Consequently, we conclude that the selection of minority data and the interpolation in the data space are beneficial for imbalanced classification problems with a relatively small number of data points.
- [188] arXiv:2405.19758 [pdf, ps, html, other]
-
Title: InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task PlanningComments: RSS 2024; this https URLSubjects: Robotics (cs.RO)
Learning abstract state representations and knowledge is crucial for long-horizon robot planning. We present InterPreT, an LLM-powered framework for robots to learn symbolic predicates from language feedback of human non-experts during embodied interaction. The learned predicates provide relational abstractions of the environment state, facilitating the learning of symbolic operators that capture action preconditions and effects. By compiling the learned predicates and operators into a PDDL domain on-the-fly, InterPreT allows effective planning toward arbitrary in-domain goals using a PDDL planner. In both simulated and real-world robot manipulation domains, we demonstrate that InterPreT reliably uncovers the key predicates and operators governing the environment dynamics. Although learned from simple training tasks, these predicates and operators exhibit strong generalization to novel tasks with significantly higher complexity. In the most challenging generalization setting, InterPreT attains success rates of 73% in simulation and 40% in the real world, substantially outperforming baseline methods.
- [189] arXiv:2405.19761 [pdf, ps, html, other]
-
Title: Revisiting CNNs for Trajectory Similarity LearningSubjects: Artificial Intelligence (cs.AI)
Similarity search is a fundamental but expensive operator in querying trajectory data, due to its quadratic complexity of distance computation. To mitigate the computational burden for long trajectories, neural networks have been widely employed for similarity learning and each trajectory is encoded as a high-dimensional vector for similarity search with linear complexity. Given the sequential nature of trajectory data, previous efforts have been primarily devoted to the utilization of RNNs or Transformers.
In this paper, we argue that the common practice of treating trajectory as sequential data results in excessive attention to capturing long-term global dependency between two sequences. Instead, our investigation reveals the pivotal role of local similarity, prompting a revisit of simple CNNs for trajectory similarity learning. We introduce ConvTraj, incorporating both 1D and 2D convolutions to capture sequential and geo-distribution features of trajectories, respectively. In addition, we conduct a series of theoretical analyses to justify the effectiveness of ConvTraj. Experimental results on three real-world large-scale datasets demonstrate that ConvTraj achieves state-of-the-art accuracy in trajectory similarity search. Owing to the simple network structure of ConvTraj, the training and inference speed on the Porto dataset with 1.6 million trajectories are increased by at least $240$x and $2.16$x, respectively. The source code and dataset can be found at \textit{\url{this https URL}}. - [190] arXiv:2405.19762 [pdf, ps, html, other]
-
Title: The Kosmosis Use-Case of Crypto Rug Pull Detection and PreventionSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Current methods to prevent crypto asset fraud are based on the analysis of transaction graphs within blockchain networks. While effective for identifying transaction patterns indicative of fraud, it does not capture the semantics of transactions and is constrained to blockchain data. Consequently, preventive methods based on transaction graphs are inherently limited. In response to these limitations, we propose the Kosmosis approach, which aims to incrementally construct a knowledge graph as new blockchain and social media data become available. During construction, it aims to extract the semantics of transactions and connect blockchain addresses to their real-world entities by fusing blockchain and social media data in a knowledge graph. This enables novel preventive methods against rug pulls as a form of crypto asset fraud. To demonstrate the effectiveness and practical applicability of the Kosmosis approach, we examine a series of real-world rug pulls from 2021. Through this case, we illustrate how Kosmosis can aid in identifying and preventing such fraudulent activities by leveraging the insights from the constructed knowledge graph.
- [191] arXiv:2405.19763 [pdf, ps, html, other]
-
Title: Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language UnderstandingComments: Accept at ACL2024 MainSubjects: Computation and Language (cs.CL)
Recent strides in large language models (LLMs) have yielded remarkable performance, leveraging reinforcement learning from human feedback (RLHF) to significantly enhance generation and alignment capabilities. However, RLHF encounters numerous challenges, including the objective mismatch issue, leading to suboptimal performance in Natural Language Understanding (NLU) tasks. To address this limitation, we propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR) to amplify the performance of LLMs in NLU tasks. By incorporating label-sensitive pairs into reinforcement learning, our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding. Experiments conducted on five diverse foundation models across eight tasks showcase promising results. In comparison to Supervised Fine-tuning models (SFT), RLLR demonstrates an average performance improvement of 1.54%. Compared with RLHF models, the improvement averages at 0.69%. These results reveal the effectiveness of our method for LLMs in NLU tasks. Code and data available at: this https URL.
- [192] arXiv:2405.19765 [pdf, ps, html, other]
-
Title: Towards Unified Multi-granularity Text Detection with Interactive AttentionComments: ICML 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including *word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT's accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.
- [193] arXiv:2405.19769 [pdf, ps, html, other]
-
Title: All-In-One Medical Image Restoration via Task-Adaptive RoutingComments: This article has been early accepted by MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Although single-task medical image restoration (MedIR) has witnessed remarkable success, the limited generalizability of these methods poses a substantial obstacle to wider application. In this paper, we focus on the task of all-in-one medical image restoration, aiming to address multiple distinct MedIR tasks with a single universal model. Nonetheless, due to significant differences between different MedIR tasks, training a universal model often encounters task interference issues, where different tasks with shared parameters may conflict with each other in the gradient update direction. This task interference leads to deviation of the model update direction from the optimal path, thereby affecting the model's performance. To tackle this issue, we propose a task-adaptive routing strategy, allowing conflicting tasks to select different network paths in spatial and channel dimensions, thereby mitigating task interference. Experimental results demonstrate that our proposed \textbf{A}ll-in-one \textbf{M}edical \textbf{I}mage \textbf{R}estoration (\textbf{AMIR}) network achieves state-of-the-art performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis, both in single-task and all-in-one settings. The code and data will be available at \href{this https URL}{this https URL}.
- [194] arXiv:2405.19771 [pdf, ps, html, other]
-
Title: Data Service Maximization in Integrated Terrestrial-Non-Terrestrial 6G Networks: A Deep Reinforcement Learning ApproachComments: 5 pages, 4 figuresSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Integrating terrestrial and non-terrestrial networks has emerged as a promising paradigm to fulfill the constantly growing demand for connectivity, low transmission delay, and quality of services (QoS). This integration brings together the strengths of terrestrial and non-terrestrial networks, such as the reliability of terrestrial networks, broad coverage, and service continuity of non-terrestrial networks like low earth orbit (LEO) satellites. In this work, we study a data service maximization problem in an integrated terrestrial-non-terrestrial network (I-TNT) where the ground base stations (GBSs) and LEO satellites cooperatively serve the coexisting aerial users (AUs) and ground users (GUs). Then, by considering the spectrum scarcity, interference, and QoS requirements of the users, we jointly optimize the user association, AUE's trajectory, and power allocation. To tackle the formulated mixed-integer non-convex problem, we disintegrate it into two subproblems: 1) user association problem and 2) trajectory and power allocation problem. Since the user association problem is a binary integer programming problem, we use the standard convex optimization method to solve it. Meanwhile, the trajectory and power allocation problem is solved by the deep deterministic policy gradient (DDPG) method to cope with the problem's non-convexity and dynamic network environments. Then, the two subproblems are alternately solved by the proposed iterative algorithm. By comparing with the baselines in the existing literature, extensive simulations are conducted to evaluate the performance of the proposed framework.
- [195] arXiv:2405.19773 [pdf, ps, html, other]
-
Title: VQA Training Sets are Self-play Environments for Generating Few-shot PoolsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets
- [196] arXiv:2405.19775 [pdf, ps, html, other]
-
Title: Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion NetworkComments: 11 pages, 11 figures, to be published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. Although transformers can better model the relationship between content and style images, they require high-cost hardware and time-consuming inference. To address these issues, we design a novel transformer model that includes only the encoder, thus significantly reducing the computational cost. In addition, we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization, we design a content feature extractor and a style feature extractor, based on which pure content and style images can be fed to the transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure content and style feature fusion network. Through qualitative and quantitative experiments, we demonstrate the advantages of our model compared to state-of-the-art ones in the literature.
- [197] arXiv:2405.19778 [pdf, ps, html, other]
-
Title: Enhancing Consistency and Role-Specific Knowledge Capturing by Rebuilding Fictional Character's PersonaComments: preprintSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
With the recent introduction of Assistants API, it is expected that document-based language models will be actively used in various domains, especially Role-playing. However, a key challenge lies in utilizing protagonist's persona: Assistants API often fails to achieve with its search because the information extraction part is different each time and it often omits important information such as protagonist's backstory or relationships. It is hard to maintain a consistent persona simply by using the persona document as input to the Assistants API. To address the challenge of achieving stable persona consistency, we propose CharacterGPT, a novel persona reconstruction framework to alleviate the shortcomings of the Assistants API. Our method involves Character Persona Training (CPT), an effective persona rebuilding process that updates the character persona by extracting the character's traits from given summary of the novel for each character as if the story in a novel progresses. In our experiments, we ask each character to take the Big Five Inventory personality test in various settings and analyze the results. To assess whether it can think outside the box, we let each character generate short novels. Extensive experiments and human evaluation demonstrate that CharacterGPT presents new possibilities for role-playing agent research.
- [198] arXiv:2405.19779 [pdf, ps, html, other]
-
Title: Automatic Graph Topology-Aware TransformerComments: This work has been submitted to the IEEE (Under Second Review). Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Neural and Evolutionary Computing (cs.NE); Graphics (cs.GR); Machine Learning (cs.LG)
Existing efforts are dedicated to designing many topologies and graph-aware strategies for the graph Transformer, which greatly improve the model's representation capabilities. However, manually determining the suitable Transformer architecture for a specific graph dataset or task requires extensive expert knowledge and laborious trials. This paper proposes an evolutionary graph Transformer architecture search framework (EGTAS) to automate the construction of strong graph Transformers. We build a comprehensive graph Transformer search space with the micro-level and macro-level designs. EGTAS evolves graph Transformer topologies at the macro level and graph-aware strategies at the micro level. Furthermore, a surrogate model based on generic architectural coding is proposed to directly predict the performance of graph Transformers, substantially reducing the evaluation cost of evolutionary search. We demonstrate the efficacy of EGTAS across a range of graph-level and node-level tasks, encompassing both small-scale and large-scale graph datasets. Experimental results and ablation studies show that EGTAS can construct high-performance architectures that rival state-of-the-art manual and automated baselines.
- [199] arXiv:2405.19782 [pdf, ps, html, other]
-
Title: Dataflow-Guided Retrieval Augmentation for Repository-Level Code CompletionComments: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.
- [200] arXiv:2405.19783 [pdf, ps, html, other]
-
Title: Instruction-Guided Visual MaskingJinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan ZhanComments: preprint, 21 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at this https URL.
- [201] arXiv:2405.19784 [pdf, ps, other]
-
Title: PixelsDB: Serverless and Natural-Language-Aided Data Analytics with Flexible Service Levels and PricesComments: 4 pages, 3 figuresSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Serverless query processing has become increasingly popular due to its advantages, including automated hardware and software management, high elasticity, and pay-as-you-go pricing. For users who are not system experts, serverless query processing greatly reduces the cost of owning a data analytic system. However, it is still a significant challenge for non-expert users to transform their complex and evolving data analytic needs into proper SQL queries and select a serverless query engine that delivers satisfactory performance and price for each type of query.
This paper presents PixelsDB, an open-source data analytic system that allows users who lack system or SQL expertise to explore data efficiently. It allows users to generate and debug SQL queries using a natural language interface powered by fine-tuned language models. The queries are then executed by a serverless query engine that offers varying prices for different service levels on query urgency. The service levels are natively supported by dedicated architecture design and heterogeneous resource scheduling that can apply cost-efficient resources to process non-urgent queries. We envision that the combination of a serverless paradigm, a natural-language-aided interface, and flexible service levels and prices will substantially improve the user experience in data analysis. - [202] arXiv:2405.19785 [pdf, ps, html, other]
-
Title: Recurrent Deep Kernel Learning of Dynamical SystemsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Digital twins require computationally-efficient reduced-order models (ROMs) that can accurately describe complex dynamics of physical assets. However, constructing ROMs from noisy high-dimensional data is challenging. In this work, we propose a data-driven, non-intrusive method that utilizes stochastic variational deep kernel learning (SVDKL) to discover low-dimensional latent spaces from data and a recurrent version of SVDKL for representing and predicting the evolution of latent dynamics. The proposed method is demonstrated with two challenging examples -- a double pendulum and a reaction-diffusion system. Results show that our framework is capable of (i) denoising and reconstructing measurements, (ii) learning compact representations of system states, (iii) predicting system evolution in low-dimensional latent spaces, and (iv) quantifying modeling uncertainties.
- [203] arXiv:2405.19787 [pdf, ps, html, other]
-
Title: From Symbolic Tasks to Code Generation: Diversification Yields Better Task PerformersSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
Instruction tuning -- tuning large language models on instruction-output pairs -- is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model's capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.
- [204] arXiv:2405.19789 [pdf, ps, html, other]
-
Title: Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised LearningComments: Accepted by IJCAI 2024Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Semi-Supervised Learning (FSSL) leverages both labeled and unlabeled data on clients to collaboratively train a this http URL FSSL, the heterogeneous data can introduce prediction bias into the model, causing the model's prediction to skew towards some certain classes. Existing FSSL methods primarily tackle this issue by enhancing consistency in model parameters or outputs. However, as the models themselves are biased, merely constraining their consistency is not sufficient to alleviate prediction bias. In this paper, we explore this bias from a Bayesian perspective and demonstrate that it principally originates from label prior bias within the training data. Building upon this insight, we propose a debiasing method for FSSL named FedDB. FedDB utilizes the Average Prediction Probability of Unlabeled Data (APP-U) to approximate the biased prior.During local training, FedDB employs APP-U to refine pseudo-labeling through Bayes' theorem, thereby significantly reducing the label prior bias. Concurrently, during the model aggregation, FedDB uses APP-U from participating clients to formulate unbiased aggregate weights, thereby effectively diminishing bias in the global model. Experimental results show that FedDB can surpass existing FSSL methods. The code is available at this https URL.
- [205] arXiv:2405.19793 [pdf, ps, html, other]
-
Title: PDDLEGO: Iterative Planning in Textual EnvironmentsComments: In *SEM 2024Subjects: Computation and Language (cs.CL)
Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).
- [206] arXiv:2405.19794 [pdf, ps, html, other]
-
Title: Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree CameraComments: CVPR2024 EgoVis WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.
- [207] arXiv:2405.19795 [pdf, ps, html, other]
-
Title: SLM as Guardian: Pioneering AI Safety with Small Language ModelsOhjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo ParkSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs. - [208] arXiv:2405.19796 [pdf, ps, html, other]
-
Title: Explainable Attribute-Based Speaker VerificationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
- [209] arXiv:2405.19799 [pdf, ps, html, other]
-
Title: Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic SegmentationSubjects: Computation and Language (cs.CL)
The advancement of large language models (LLMs) has propelled the development of dialogue systems. Unlike the popular ChatGPT-like assistant model, which only satisfies the user's preferences, task-oriented dialogue systems have also faced new requirements and challenges in the broader business field. They are expected to provide correct responses at each dialogue turn, at the same time, achieve the overall goal defined by the task. By understanding rhetorical structures and topic structures via topic segmentation and discourse parsing, a dialogue system may do a better planning to achieve both objectives. However, while both structures belong to discourse structure in linguistics, rhetorical structure and topic structure are mostly modeled separately or with one assisting the other in the prior work. The interaction between these two structures has not been considered for joint modeling and mutual learning. Furthermore, unsupervised learning techniques to achieve the above are not well explored. To fill this gap, we propose an unsupervised mutual learning framework of two structures leveraging the global and local connections between them. We extend the topic modeling between non-adjacent discourse units to ensure global structural relevance with rhetorical structures. We also incorporate rhetorical structures into the topic structure through a graph neural network model to ensure local coherence consistency. Finally, we utilize the similarity between the two fused structures for mutual learning. The experimental results demonstrate that our methods outperform all strong baselines on two dialogue rhetorical datasets (STAC and Molweni), as well as dialogue topic datasets (Doc2Dial and TIAGE).
- [210] arXiv:2405.19802 [pdf, ps, html, other]
-
Title: Exploring the Robustness of Decision-Level Through Adversarial Attacks on LLM-Based Embodied ModelsSubjects: Multimedia (cs.MM)
Embodied intelligence empowers agents with a profound sense of perception, enabling them to respond in a manner closely aligned with real-world situations. Large Language Models (LLMs) delve into language instructions with depth, serving a crucial role in generating plans for intricate tasks. Thus, LLM-based embodied models further enhance the agent's capacity to comprehend and process information. However, this amalgamation also ushers in new challenges in the pursuit of heightened intelligence. Specifically, attackers can manipulate LLMs to produce irrelevant or even malicious outputs by altering their prompts. Confronted with this challenge, we observe a notable absence of multi-modal datasets essential for comprehensively evaluating the robustness of LLM-based embodied models. Consequently, we construct the Embodied Intelligent Robot Attack Dataset (EIRAD), tailored specifically for robustness evaluation. Additionally, two attack strategies are devised, including untargeted attacks and targeted attacks, to effectively simulate a range of diverse attack scenarios. At the same time, during the attack process, to more accurately ascertain whether our method is successful in attacking the LLM-based embodied model, we devise a new attack success evaluation method utilizing the BLIP2 model. Recognizing the time and cost-intensive nature of the GCG algorithm in attacks, we devise a scheme for prompt suffix initialization based on various target tasks, thus expediting the convergence process. Experimental results demonstrate that our method exhibits a superior attack success rate when targeting LLM-based embodied models, indicating a lower level of decision-level robustness in these models.
- [211] arXiv:2405.19804 [pdf, ps, other]
-
Title: Exploring Key Factors for Long-Term Vessel Incident Risk PredictionSubjects: Machine Learning (cs.LG)
Factor analysis acts a pivotal role in enhancing maritime safety. Most previous studies conduct factor analysis within the framework of incident-related label prediction, where the developed models can be categorized into short-term and long-term prediction models. The long-term models offer a more strategic approach, enabling more proactive risk management, compared to the short-term ones. Nevertheless, few studies have devoted to rigorously identifying the key factors for the long-term prediction and undertaking comprehensive factor analysis. Hence, this study aims to delve into the key factors for predicting the incident risk levels in the subsequent year given a specific datestamp. The majority of candidate factors potentially contributing to the incident risk are collected from vessels' historical safety performance data spanning up to five years. An improved embedded feature selection, which integrates Random Forest classifier with a feature filtering process is proposed to identify key risk-contributing factors from the candidate pool. The results demonstrate superior performance of the proposed method in incident prediction and factor interpretability. Comprehensive analysis is conducted upon the key factors, which could help maritime stakeholders formulate management strategies for incident prevenion.
- [212] arXiv:2405.19805 [pdf, ps, html, other]
-
Title: Complexity of Deciding Injectivity and Surjectivity of ReLU Neural NetworksComments: 17 pagesSubjects: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
Neural networks with ReLU activation play a key role in modern machine learning. In view of safety-critical applications, the verification of trained networks is of great importance and necessitates a thorough understanding of essential properties of the function computed by a ReLU network, including characteristics like injectivity and surjectivity. Recently, Puthawala et al. [JMLR 2022] came up with a characterization for injectivity of a ReLU layer, which implies an exponential time algorithm. However, the exact computational complexity of deciding injectivity remained open. We answer this question by proving coNP-completeness of deciding injectivity of a ReLU layer. On the positive side, as our main result, we present a parameterized algorithm which yields fixed-parameter tractability of the problem with respect to the input dimension. In addition, we also characterize surjectivity for two-layer ReLU networks with one-dimensional output. Remarkably, the decision problem turns out to be the complement of a basic network verification task. We prove NP-hardness for surjectivity, implying a stronger hardness result than previously known for the network verification problem. Finally, we reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem
- [213] arXiv:2405.19806 [pdf, ps, html, other]
-
Title: Preference Alignment with Flow MatchingSubjects: Machine Learning (cs.LG)
We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method's alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference.
- [214] arXiv:2405.19807 [pdf, ps, other]
-
Title: MetaCURL: Non-stationary Concave Utility Reinforcement LearningBianca Marin Moreno (UGA, Thoth, EDF R&D, FiME Lab), Margaux Brégère (LPSM, EDF R&D), Pierre Gaillard (UGA, Thoth), Nadia Oudjane (EDF R&D, FiME Lab)Subjects: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
We explore online learning in episodic loop-free Markov decision processes on non-stationary environments (changing losses and probability transitions). Our focus is on the Concave Utility Reinforcement Learning problem (CURL), an extension of classical RL for handling convex performance criteria in state-action distributions induced by agent policies. While various machine learning problems can be written as CURL, its non-linearity invalidates traditional Bellman equations. Despite recent solutions to classical CURL, none address non-stationary MDPs. This paper introduces MetaCURL, the first CURL algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple black-box algorithms instances over different intervals, aggregating outputs via a sleeping expert framework. The key hurdle is partial information due to MDP uncertainty. Under partial information on the probability transitions (uncertainty and non-stationarity coming only from external noise, independent of agent state-action pairs), we achieve optimal dynamic regret without prior knowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full adversarial losses, not just stochastic ones. We believe our approach for managing non-stationarity with experts can be of interest to the RL community.
- [215] arXiv:2405.19808 [pdf, ps, other]
-
Title: AI with Alien Content and Alien MetasemanticsComments: 20 pages, book chapterJournal-ref: in Ernie Lepore and Luvell Anderson (Eds), The Oxford Handbook of Applied Philosophy of Language, Oxford Handbooks (2024)Subjects: Artificial Intelligence (cs.AI)
AlphaGo plays chess and Go in a creative and novel way. It is natural for us to attribute contents to it, such as that it doesn't view being several pawns behind, if it has more board space, as bad. The framework introduced in Cappelen and Dever (2021) provides a way of thinking about the semantics and the metasemantics of AI content: does AlphaGo entertain contents like this, and if so, in virtue of what does a given state of the program mean that particular content? One salient question Cappelen and Dever didn't consider was the possibility of alien content. Alien content is content that is not or cannot be expressed by human beings. It's highly plausible that AlphaGo, or any other sophisticated AI system, expresses alien contents. That this is so, moreover, is plausibly a metasemantic fact: a fact that has to do with how AI comes to entertain content in the first place, one that will heed the vastly different etiology of AI and human content. This chapter explores the question of alien content in AI from a semantic and metasemantic perspective. It lays out the logical space of possible responses to the semantic and metasemantic questions alien content poses, considers whether and how we humans could communicate with entities who express alien content, and points out that getting clear about such questions might be important for more 'applied' issues in the philosophy of AI, such as existential risk and XAI.
- [216] arXiv:2405.19811 [pdf, ps, html, other]
-
Title: Approximate Global Convergence of Independent Learning in Multi-Agent SystemsSubjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Independent learning (IL), despite being a popular approach in practice to achieve scalability in large-scale multi-agent systems, usually lacks global convergence guarantees. In this paper, we study two representative algorithms, independent $Q$-learning and independent natural actor-critic, within value-based and policy-based frameworks, and provide the first finite-sample analysis for approximate global convergence. The results imply a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ up to an error term that captures the dependence among agents and characterizes the fundamental limit of IL in achieving global convergence. To establish the result, we develop a novel approach for analyzing IL by constructing a separable Markov decision process (MDP) for convergence analysis and then bounding the gap due to model difference between the separable MDP and the original one. Moreover, we conduct numerical experiments using a synthetic MDP and an electric vehicle charging example to verify our theoretical findings and to demonstrate the practical applicability of IL.
- [217] arXiv:2405.19813 [pdf, ps, html, other]
-
Title: SLAM-based Joint Calibration of Multiple Asynchronous Microphone Arrays and Sound Source LocalizationJiang Wang, Yuanzheng He, Daobilige Su, Katsutoshi Itoyama, Kazuhiro Nakadai, Junfeng Wu, Shoudong Huang, Youfu Li, He KongComments: This paper was accepted to and going to appear in the IEEE Transactions on RoboticsSubjects: Robotics (cs.RO)
Robot audition systems with multiple microphone arrays have many applications in practice. However, accurate calibration of multiple microphone arrays remains challenging because there are many unknown parameters to be identified, including the relative transforms (i.e., orientation, translation) and asynchronous factors (i.e., initial time offset and sampling clock difference) between microphone arrays. To tackle these challenges, in this paper, we adopt batch simultaneous localization and mapping (SLAM) for joint calibration of multiple asynchronous microphone arrays and sound source localization. Using the Fisher information matrix (FIM) approach, we first conduct the observability analysis (i.e., parameter identifiability) of the above-mentioned calibration problem and establish necessary/sufficient conditions under which the FIM and the Jacobian matrix have full column rank, which implies the identifiability of the unknown parameters. We also discover several scenarios where the unknown parameters are not uniquely identifiable. Subsequently, we propose an effective framework to initialize the unknown parameters, which is used as the initial guess in batch SLAM for multiple microphone arrays calibration, aiming to further enhance optimization accuracy and convergence. Extensive numerical simulations and real experiments have been conducted to verify the performance of the proposed method. The experiment results show that the proposed pipeline achieves higher accuracy with fast convergence in comparison to methods that use the noise-corrupted ground truth of the unknown parameters as the initial guess in the optimization and other existing frameworks.
- [218] arXiv:2405.19815 [pdf, ps, html, other]
-
Title: Efficient Stimuli Generation using Reinforcement Learning in Design VerificationComments: Accepted for publication at the 20th International Conference on Synthesis, Modeling, Analysis and Simulation Methods, and Applications to Circuit Design (SMACD'24), Jul 2-5 2024, Volos, GreeceSubjects: Artificial Intelligence (cs.AI)
The increasing design complexity of System-on-Chips (SoCs) has led to significant verification challenges, particularly in meeting coverage targets within a timely manner. At present, coverage closure is heavily dependent on constrained random and coverage driven verification methodologies where the randomized stimuli are bounded to verify certain scenarios and to reach coverage goals. This process is said to be exhaustive and to consume a lot of project time. In this paper, a novel methodology is proposed to generate efficient stimuli with the help of Reinforcement Learning (RL) to reach the maximum code coverage of the Design Under Verification (DUV). Additionally, an automated framework is created using metamodeling to generate a SystemVerilog testbench and an RL environment for any given design. The proposed approach is applied to various designs and the produced results proves that the RL agent provides effective stimuli to achieve code coverage faster in comparison with baseline random simulations. Furthermore, various RL agents and reward schemes are analyzed in our work.
- [219] arXiv:2405.19816 [pdf, ps, other]
-
Title: Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them OptimallySubjects: Artificial Intelligence (cs.AI)
Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecture-dependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from %the backpropagation. To do this, we propose a mathematical definition of expressivity bottlenecks, which enables us to detect, quantify and solve them while training, by adding suitable neurons when and where needed. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we are able to start with very small neural networks and let them grow appropriately. As a proof of concept, we show results~on the CIFAR dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter search.
- [220] arXiv:2405.19817 [pdf, ps, other]
-
Title: Performance Examination of Symbolic Aggregate Approximation in IoT ApplicationsComments: Embedded World Conference, Nuremberg, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Symbolic Aggregate approXimation (SAX) is a common dimensionality reduction approach for time-series data which has been employed in a variety of domains, including classification and anomaly detection in time-series data. Domains also include shape recognition where the shape outline is converted into time-series data forinstance epoch classification of archived arrowheads. In this paper we propose a dimensionality reduction and shape recognition approach based on the SAX algorithm, an application which requires responses on cost efficient, IoT-like, platforms. The challenge is largely dealing with the computational expense of the SAX algorithm in IoT-like applications, from simple time-series dimension reduction through shape recognition. The approach is based on lowering the dimensional space while capturing and preserving the most representative features of the shape. We present three scenarios of increasing computational complexity backing up our statements with measurement of performance characteristics
- [221] arXiv:2405.19818 [pdf, ps, html, other]
-
Title: WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale BenchmarkComments: GitHub project: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1.1 million frames across 1,500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Most existing trackers are tailored for open-air environments, leading to performance degradation when applied to UOT due to domain gaps. Retraining and fine-tuning these trackers are challenging due to sample imbalances and limited real-world underwater datasets. To tackle these challenges, we propose a novel omni-knowledge distillation framework based on WebUOT-1M, incorporating various strategies to guide the learning of the student Transformer. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. Furthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers, showcasing its value as a benchmark for UOT research by presenting new challenges and opportunities for future studies. The complete dataset, codes and tracking results, will be made publicly available.
- [222] arXiv:2405.19819 [pdf, ps, html, other]
-
Title: Gated Fields: Learning Scene Reconstruction from Gated VideosSubjects: Computer Vision and Pattern Recognition (cs.CV)
Reconstructing outdoor 3D scenes from temporal observations is a challenge that recent work on neural fields has offered a new avenue for. However, existing methods that recover scene properties, such as geometry, appearance, or radiance, solely from RGB captures often fail when handling poorly-lit or texture-deficient regions. Similarly, recovering scenes with scanning LiDAR sensors is also difficult due to their low angular sampling rate which makes recovering expansive real-world scenes difficult. Tackling these gaps, we introduce Gated Fields - a neural scene reconstruction method that utilizes active gated video sequences. To this end, we propose a neural rendering approach that seamlessly incorporates time-gated capture and illumination. Our method exploits the intrinsic depth cues in the gated videos, achieving precise and dense geometry reconstruction irrespective of ambient illumination conditions. We validate the method across day and night scenarios and find that Gated Fields compares favorably to RGB and LiDAR reconstruction methods. Our code and datasets are available at this https URL.
- [223] arXiv:2405.19822 [pdf, ps, html, other]
-
Title: Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline MethodologyFrank A. Ruis, Alma M. Liezenga, Friso G. Heslinga, Luca Ballan, Thijs A. Eker, Richard J. M. den Hollander, Martin C. van Leeuwen, Judith Dijk, Wyke HuizingaComments: Submitted to and presented at SPIE Defense + Commercial Sensing 2024, 13 pages, 4 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.
- [224] arXiv:2405.19823 [pdf, ps, html, other]
-
Title: Joint Selective State Space Model and Detrending for Robust Time Series Anomaly DetectionComments: Submitted to IEEE Signal Processing LettersSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Deep learning-based sequence models are extensively employed in Time Series Anomaly Detection (TSAD) tasks due to their effective sequential modeling capabilities. However, the ability of TSAD is limited by two key challenges: (i) the ability to model long-range dependency and (ii) the generalization issue in the presence of non-stationary data. To tackle these challenges, an anomaly detector that leverages the selective state space model known for its proficiency in capturing long-term dependencies across various domains is proposed. Additionally, a multi-stage detrending mechanism is introduced to mitigate the prominent trend component in non-stationary data to address the generalization issue. Extensive experiments conducted on realworld public datasets demonstrate that the proposed methods surpass all 12 compared baseline methods.
- [225] arXiv:2405.19831 [pdf, ps, html, other]
-
Title: Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextComments: 10 pages, 2 figures, 2 tables. Accepted to ARES 2024 (IWAPS)Subjects: Computation and Language (cs.CL)
The study of Differential Privacy (DP) in Natural Language Processing often views the task of text privatization as a $\textit{rewriting}$ task, in which sensitive input texts are rewritten to hide explicit or implicit private information. In order to evaluate the privacy-preserving capabilities of a DP text rewriting mechanism, $\textit{empirical privacy}$ tests are frequently employed. In these tests, an adversary is modeled, who aims to infer sensitive information (e.g., gender) about the author behind a (privatized) text. Looking to improve the empirical protections provided by DP rewriting methods, we propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts, where DP rewritten texts are rewritten $\textit{again}$. Our results shown that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations. Therefore, our approach raises the bar for DP rewriting methods in their empirical privacy evaluations, providing an extra layer of protection against malicious adversaries.
- [226] arXiv:2405.19832 [pdf, ps, other]
-
Title: AI Safety: A Climb To Armageddon?Comments: 20 page articleSubjects: Artificial Intelligence (cs.AI)
This paper presents an argument that certain AI safety measures, rather than mitigating existential risk, may instead exacerbate it. Under certain key assumptions - the inevitability of AI failure, the expected correlation between an AI system's power at the point of failure and the severity of the resulting harm, and the tendency of safety measures to enable AI systems to become more powerful before failing - safety efforts have negative expected utility. The paper examines three response strategies: Optimism, Mitigation, and Holism. Each faces challenges stemming from intrinsic features of the AI safety landscape that we term Bottlenecking, the Perfection Barrier, and Equilibrium Fluctuation. The surprising robustness of the argument forces a re-examination of core assumptions around AI safety and points to several avenues for further research.
- [227] arXiv:2405.19833 [pdf, ps, html, other]
-
Title: KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree RotationComments: Accepted by CVPR24Subjects: Computer Vision and Pattern Recognition (cs.CV)
2D keypoints are commonly used as an additional cue to refine estimated 3D human meshes. Current methods optimize the pose and shape parameters with a reprojection loss on the provided 2D keypoints. Such an approach, while simple and intuitive, has limited effectiveness because the optimal solution is hard to find in ambiguous parameter space and may sacrifice depth. Additionally, divergent gradients from distal joints complicate and deviate the refinement of proximal joints in the kinematic chain. To address these, we introduce Kinematic-Tree Rotation (KITRO), a novel mesh refinement strategy that explicitly models depth and human kinematic-tree structure. KITRO treats refinement from a bone-wise perspective. Unlike previous methods which perform gradient-based optimizations, our method calculates bone directions in closed form. By accounting for the 2D pose, bone length, and parent joint's depth, the calculation results in two possible directions for each child joint. We then use a decision tree to trace binary choices for all bones along the human skeleton's kinematic-tree to select the most probable hypothesis. Our experiments across various datasets and baseline models demonstrate that KITRO significantly improves 3D joint estimation accuracy and achieves an ideal 2D fit simultaneously. Our code available at: this https URL.
- [228] arXiv:2405.19836 [pdf, ps, html, other]
-
Title: The Merit of River Network Topology for Neural Flood ForecastingComments: this https URLJournal-ref: ICML 2024Subjects: Machine Learning (cs.LG)
Climate change exacerbates riverine floods, which occur with higher frequency and intensity than ever. The much-needed forecasting systems typically rely on accurate river discharge predictions. To this end, the SOTA data-driven approaches treat forecasting at spatially distributed gauge stations as isolated problems, even within the same river network. However, incorporating the known topology of the river network into the prediction model has the potential to leverage the adjacency relationship between gauges. Thus, we model river discharge for a network of gauging stations with GNNs and compare the forecasting performance achieved by different adjacency definitions. Our results show that the model fails to benefit from the river network topology information, both on the entire network and small subgraphs. The learned edge weights correlate with neither of the static definitions and exhibit no regular pattern. Furthermore, the GNNs struggle to predict sudden, narrow discharge spikes. Our work hints at a more general underlying phenomenon of neural prediction not always benefitting from graphical structure and may inspire a systematic study of the conditions under which this happens.
- [229] arXiv:2405.19837 [pdf, ps, other]
-
Title: Lifelong learning challenges in the era of artificial intelligence: a computational thinking perspectiveMargarida Romero (LINE, COMUE UCA, ULaval, Mnemosyne)Journal-ref: IRMBAM, Ipag, Jul 2024, Nice, FranceSubjects: Artificial Intelligence (cs.AI)
The rapid advancement of artificial intelligence (AI) has brought significant challenges to the education and workforce skills required to take advantage of AI for human-AI collaboration in the workplace. As AI continues to reshape industries and job markets, the need to define how AI literacy can be considered in lifelong learning has become increasingly critical (Cetindamar et al., 2022; Laupichler et al., 2022; Romero et al., 2023). Like any new technology, AI is the subject of both hopes and fears, and what it entails today presents major challenges (Cugurullo \& Acheampong, 2023; Villani et al., 2018). It also raises profound questions about our own humanity. Will the machine surpass the intelligence of the humans who designed it? What will be the relationship between so-called AI and our human intelligences? How could human-AI collaboration be regulated in a way that serves the Sustainable Development Goals (SDGs)? This paper provides a review of the challenges of lifelong learning in the era of AI from a computational thinking, critical thinking, and creative competencies perspective, highlighting the implications for management and leadership in organizations.
- [230] arXiv:2405.19842 [pdf, ps, html, other]
-
Title: Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs DistillationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives -- removing the answer from outputs and concatenating the question with the rationale as input -- CasCoD's two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at this https URL.
- [231] arXiv:2405.19843 [pdf, ps, html, other]
-
Title: How Gold to Make the Golden Snitch: Designing the "Game Changer" in EsportsSubjects: Computer Science and Game Theory (cs.GT)
Many battling games utilize a special item (e.g. Roshan in Defense of the Ancients 2 (DOTA 2), Baron Nashor in League of Legends (LOL), Golden Snitch in Quidditch) as a potential ``Game Changer''. The reward of this item can enable the underdog to make a comeback. However, if the reward is excessively high, the whole game may devolve into a chase for the ``Game Changer''. Our research initiates with a Quidditch case study, a fictional sport in Harry Potter series, wherein we architect the Golden Snitch's reward to maximize the audience's surprise. Surprisingly, we discover that for equally competent teams, the optimal Snitch reward is zero. Moreover, we establish that under most circumstances the optimal score aligns with the game's expected duration multiplied by the teams' strength difference. Finally, we explore the correlation between the ``Game Changer's'' reward and audience surprise in Multiplayer Online Battle Arena (MOBA) games including DOTA 2 and LOL, finding that the optimal reward escalates with increasing team strength inequality.
- [232] arXiv:2405.19844 [pdf, ps, other]
-
Title: The Neumann boundary condition for the two-dimensional Lax-Wendroff scheme. IIAntoine Benoit (LMPA), Jean-François Coulombel (IMT)Subjects: Numerical Analysis (math.NA)
We study the stability of a two-dimensional Lax-Wendroff scheme in a quarter-plane. Following our previous work, we aim here at adapting the energy method in order to study second order extrapolation boundary conditions. We first show on the one-dimensional problem why modifying the energy is a necessity in order to obtain stability estimates. We then study the two-dimensional case and propose a modified energy as well as second order extrapolation boundary and corner conditions in order to maintain second order accuracy and stability of the whole scheme, including near the corner.
- [233] arXiv:2405.19845 [pdf, ps, html, other]
-
Title: Assessing the impact of weather-induced uncertainties in large-scale electricity systemsSubjects: Systems and Control (eess.SY)
The future energy system will largely depend on volatile renewable energy sources and temperature-dependent loads, which makes the weather a central influencing factor. This article presents a novel approach for simulating weather scenarios for robust large-scale power system analysis. By applying different signal analysis methods, historical weather data is decomposed into its spectral components, processed appropriately, and then used to generate random, self-consistent weather data. In this process, any weather parameters of different locations can be considered, while their respective dependencies are mapped. The added value is demonstrated by coupling with a state-of-the-art large-scale energy system model for Europe. It is shown that the integrated consideration of different weather influences allows a quantification of the range of fluctuation of various parameters - such as the feed-in of wind and solar power - and thus provides the basis for future resilient grid planning approaches.
- [234] arXiv:2405.19846 [pdf, ps, html, other]
-
Title: Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language ModelSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.
- [235] arXiv:2405.19850 [pdf, ps, html, other]
-
Title: Deciphering Human Mobility: Inferring Semantics of Trajectories with Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Understanding human mobility patterns is essential for various applications, from urban planning to public safety. The individual trajectory such as mobile phone location data, while rich in spatio-temporal information, often lacks semantic detail, limiting its utility for in-depth mobility analysis. Existing methods can infer basic routine activity sequences from this data, lacking depth in understanding complex human behaviors and users' characteristics. Additionally, they struggle with the dependency on hard-to-obtain auxiliary datasets like travel surveys. To address these limitations, this paper defines trajectory semantic inference through three key dimensions: user occupation category, activity sequence, and trajectory description, and proposes the Trajectory Semantic Inference with Large Language Models (TSI-LLM) framework to leverage LLMs infer trajectory semantics comprehensively and deeply. We adopt spatio-temporal attributes enhanced data formatting (STFormat) and design a context-inclusive prompt, enabling LLMs to more effectively interpret and infer the semantics of trajectory data. Experimental validation on real-world trajectory datasets demonstrates the efficacy of TSI-LLM in deciphering complex human mobility patterns. This study explores the potential of LLMs in enhancing the semantic analysis of trajectory data, paving the way for more sophisticated and accessible human mobility research.
- [236] arXiv:2405.19851 [pdf, ps, html, other]
-
Title: Guardians of DNS Integrity: A Remote Method for Identifying DNSSEC Validators Across the InternetSubjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
DNS Security Extensions (DNSSEC) provide the most effective way to fight DNS cache poisoning attacks. Yet, very few DNS resolvers perform DNSSEC validation. Identifying such systems is non-trivial and the existing methods are not suitable for Internet-scale measurements. In this paper, we propose a novel remote technique for identifying DNSSEC-validating resolvers. The proposed method consists of two steps. In the first step, we identify open resolvers by scanning 3.1 billion end hosts and request every non-forwarder to resolve one correct and seven deliberately misconfigured domains. We then build a classifier that discriminates validators from non-validators based on query patterns and DNS response codes. We find that while most open resolvers are DNSSEC-enabled, less than 18% in IPv4 (38% in IPv6) validate received responses. In the second step, we remotely identify closed non-forwarders in networks that do not have inbound Source Address Validation (SAV) in place. Using the classifier built in step one, we identify 37.4% IPv4 (42.9% IPv6) closed DNSSEC validators and cross-validate the results using RIPE Atlas probes. Finally, we show that the discovered (non)-validators actively send requests to DNS root servers, suggesting that we deal with operational recursive resolvers rather than misconfigured machines.
- [237] arXiv:2405.19854 [pdf, ps, html, other]
-
Title: RTGen: Generating Region-Text Pairs for Open-Vocabulary Object DetectionComments: Technical reportSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.
- [238] arXiv:2405.19856 [pdf, ps, html, other]
-
Title: DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code RepositoriesJia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin LiComments: Accepted by the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv admin note: substantial text overlap with arXiv:2404.00599, arXiv:2401.06401Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs.
To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released. - [239] arXiv:2405.19861 [pdf, ps, other]
-
Title: Hierarchical Object-Centric Learning with Capsule NetworksComments: Updated version of my PhD thesis (Nov 2023), with fixed typos. Will keep updated as new typos are discovered!Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Capsule networks (CapsNets) were introduced to address convolutional neural networks limitations, learning object-centric representations that are more robust, pose-aware, and interpretable. They organize neurons into groups called capsules, where each capsule encodes the instantiation parameters of an object or one of its parts. Moreover, a routing algorithm connects capsules in different layers, thereby capturing hierarchical part-whole relationships in the data.
This thesis investigates the intriguing aspects of CapsNets and focuses on three key questions to unlock their full potential. First, we explore the effectiveness of the routing algorithm, particularly in small-sized networks. We propose a novel method that anneals the number of routing iterations during training, enhancing performance in architectures with fewer parameters.
Secondly, we investigate methods to extract more effective first-layer capsules, also known as primary capsules. By exploiting pruned backbones, we aim to improve computational efficiency by reducing the number of capsules while achieving high generalization. This approach reduces CapsNets memory requirements and computational effort.
Third, we explore part-relationship learning in CapsNets. Through extensive research, we demonstrate that capsules with low entropy can extract more concise and discriminative part-whole relationships compared to traditional capsule networks, even with reasonable network sizes.
Lastly, we showcase how CapsNets can be utilized in real-world applications, including autonomous localization of unmanned aerial vehicles, quaternion-based rotations prediction in synthetic datasets, and lung nodule segmentation in biomedical imaging.
The findings presented in this thesis contribute to a deeper understanding of CapsNets and highlight their potential to address complex computer vision challenges. - [240] arXiv:2405.19864 [pdf, ps, other]
-
Title: Out-of-distribution Reject Option Method for Dataset Shift Problem in Early Disease Onset PredictionTaisei Tosaki, Eiichiro Uchino, Ryosuke Kojima, Yohei Mineharu, Mikio Arita, Nobuyuki Miyai, Yoshinori Tamada, Tatsuya Mikami, Koichi Murashita, Shigeyuki Nakaji, Yasushi OkunoSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
Machine learning is increasingly used to predict lifestyle-related disease onset using health and medical data. However, the prediction effectiveness is hindered by dataset shift, which involves discrepancies in data distribution between the training and testing datasets, misclassifying out-of-distribution (OOD) data. To diminish dataset shift effects, this paper proposes the out-of-distribution reject option for prediction (ODROP), which integrates OOD detection models to preclude OOD data from the prediction phase. We investigated the efficacy of five OOD detection methods (variational autoencoder, neural network ensemble std, neural network ensemble epistemic, neural network energy, and neural network gaussian mixture based energy measurement) across two datasets, the Hirosaki and Wakayama health checkup data, in the context of three disease onset prediction tasks: diabetes, dyslipidemia, and hypertension. To evaluate the ODROP method, we trained disease onset prediction models and OOD detection models on Hirosaki data and used AUROC-rejection curve plots from Wakayama data. The variational autoencoder method showed superior stability and magnitude of improvement in Area Under the Receiver Operating Curve (AUROC) in five cases: AUROC in the Wakayama data was improved from 0.80 to 0.90 at a 31.1% rejection rate for diabetes onset and from 0.70 to 0.76 at a 34% rejection rate for dyslipidemia. We categorized dataset shifts into two types using SHAP clustering - those that considerably affect predictions and those that do not. We expect that this classification will help standardize measuring instruments. This study is the first to apply OOD detection to actual health and medical data, demonstrating its potential to substantially improve the accuracy and reliability of disease prediction models amidst dataset shift.
- [241] arXiv:2405.19869 [pdf, ps, html, other]
-
Title: Semantic Landmark Detection & Classification Using Neural Networks For 3D In-Air SonarSubjects: Robotics (cs.RO)
In challenging environments where traditional sensing modalities struggle, in-air sonar offers resilience to optical interference. Placing a priori known landmarks in these environments can eliminate accumulated errors in autonomous mobile systems such as Simultaneous Localization and Mapping (SLAM) and autonomous navigation. We present a novel approach using a convolutional neural network to detect and classify ten different reflector landmarks with varying radii using in-air 3D sonar. Additionally, the network predicts the orientation angle of the detected landmarks. The neural network is trained on cochleograms, representing echoes received by the sensor in a time-frequency domain. Experimental results in cluttered indoor settings show promising performance. The CNN achieves a 97.3% classification accuracy on the test dataset, accurately detecting both the presence and absence of landmarks. Moreover, the network predicts landmark orientation angles with an RMSE lower than 10 degrees, enhancing the utility in SLAM and autonomous navigation applications. This advancement improves the robustness and accuracy of autonomous systems in challenging environments.
- [242] arXiv:2405.19870 [pdf, ps, html, other]
-
Title: On Vessel Location Forecasting and the Effect of Federated LearningJournal-ref: 2024 IEEE International Conference on Mobile Data Management (MDM), June 24 - June 27, 2024, Brussels, BelgiumSubjects: Machine Learning (cs.LG)
The wide spread of Automatic Identification System (AIS) has motivated several maritime analytics operations. Vessel Location Forecasting (VLF) is one of the most critical operations for maritime awareness. However, accurate VLF is a challenging problem due to the complexity and dynamic nature of maritime traffic conditions. Furthermore, as privacy concerns and restrictions have grown, training data has become increasingly fragmented, resulting in dispersed databases of several isolated data silos among different organizations, which in turn decreases the quality of learning models. In this paper, we propose an efficient VLF solution based on LSTM neural networks, in two variants, namely Nautilus and FedNautilus for the centralized and the federated learning approach, respectively. We also demonstrate the superiority of the centralized approach with respect to current state of the art and discuss the advantages and disadvantages of the federated against the centralized approach.
- [243] arXiv:2405.19871 [pdf, ps, html, other]
-
Title: Don't Get Hijacked: Prevalence, Mitigation, and Impact of Non-Secure DNS Dynamic UpdatesSubjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
DNS dynamic updates represent an inherently vulnerable mechanism deliberately granting the potential for any host to dynamically modify DNS zone files. Consequently, this feature exposes domains to various security risks such as domain hijacking, compromise of domain control validation, and man-in-the-middle attacks. Originally devised without the implementation of authentication mechanisms, non-secure DNS updates were widely adopted in DNS software, subsequently leaving domains susceptible to a novel form of attack termed zone poisoning. In order to gauge the extent of this issue, our analysis encompassed over 353 million domain names, revealing the presence of 381,965 domains that openly accepted unsolicited DNS updates. We then undertook a comprehensive three-phase campaign involving the notification of Computer Security Incident Response Teams (CSIRTs). Following extensive discussions spanning six months, we observed substantial remediation, with nearly 54\% of nameservers and 98% of vulnerable domains addressing the issue. This outcome serves as evidence that engaging with CSIRTs can prove to be an effective approach for reporting security vulnerabilities. Moreover, our notifications had a lasting impact, as evidenced by the sustained low prevalence of vulnerable domains.
- [244] arXiv:2405.19872 [pdf, ps, html, other]
-
Title: Detection of the papermilling behaviorComments: 14 pages, 6 figuresSubjects: Digital Libraries (cs.DL)
Based on the analysis of the data obtainable from the Web of Science publication and citation database, typical signs of possible papermilling behavior are described, quantified, and illustrated by examples. A MATLAB function is provided for the analysis of the outputs from the Web of Science. A new quantitative indicator -- integrity index, or I-index -- is proposed for using it along with standard bibliographic and scientometric indicators.
- [245] arXiv:2405.19874 [pdf, ps, html, other]
-
Title: Is In-Context Learning Sufficient for Instruction Following in LLMs?Comments: Preprint. Code at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In-context learning (ICL) allows LLMs to learn from examples without changing their weights, which is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for tasks such as classification, translation, or summarization, adding more ICL demonstrations for long-context LLMs does not systematically improve instruction following performance. To address this limitation, we derive a greedy selection approach for ICL examples that noticeably improves performance, yet without bridging the gap to instruction fine-tuning. Finally, we provide a series of ablation studies to better understand the reasons behind the remaining gap, and we show how some aspects of ICL depart from the existing knowledge and are specific to the instruction tuning setting. Overall, our work advances the understanding of ICL as an alignment technique. We provide our code at this https URL.
- [246] arXiv:2405.19876 [pdf, ps, html, other]
-
Title: IReNe: Instant Recoloring in Neural Radiance FieldsAlessio Mazzucchelli, Adrian Garcia-Garcia, Elena Garces, Fernando Rivas-Manzaneque, Francesc Moreno-Noguer, Adrian Penate-SanchezSubjects: Computer Vision and Pattern Recognition (cs.CV)
Advances in NERFs have allowed for 3D scene reconstructions and novel view synthesis. Yet, efficiently editing these representations while retaining photorealism is an emerging challenge. Recent methods face three primary limitations: they're slow for interactive use, lack precision at object boundaries, and struggle to ensure multi-view consistency. We introduce IReNe to address these limitations, enabling swift, near real-time color editing in NeRF. Leveraging a pre-trained NeRF model and a single training image with user-applied color edits, IReNe swiftly adjusts network parameters in seconds. This adjustment allows the model to generate new scene views, accurately representing the color changes from the training image while also controlling object boundaries and view-specific effects. Object boundary control is achieved by integrating a trainable segmentation module into the model. The process gains efficiency by retraining only the weights of the last network layer. We observed that neurons in this layer can be classified into those responsible for view-dependent appearance and those contributing to diffuse appearance. We introduce an automated classification approach to identify these neuron types and exclusively fine-tune the weights of the diffuse neurons. This further accelerates training and ensures consistent color edits across different views. A thorough validation on a new dataset, with edited object colors, shows significant quantitative and qualitative advancements over competitors, accelerating speeds by 5x to 500x.
- [247] arXiv:2405.19877 [pdf, ps, html, other]
-
Title: KNOW: A Real-World Ontology for Knowledge Capture with Large Language ModelsComments: 5 pages, 1 figureSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We present KNOW--the Knowledge Navigator Ontology for the World--the first ontology designed to capture everyday knowledge to augment large language models (LLMs) in real-world generative AI use cases such as personal AI assistants. Our domain is human life, both its everyday concerns and its major milestones. We have limited the initial scope of the modeled concepts to only established human universals: spacetime (places, events) plus social (people, groups, organizations). The inclusion criteria for modeled concepts are pragmatic, beginning with universality and utility. We compare and contrast previous work such as this http URL and Cyc--as well as attempts at a synthesis of knowledge graphs and language models--noting how LLMs already encode internally much of the commonsense tacit knowledge that took decades to capture in the Cyc project. We also make available code-generated software libraries for the 12 most popular programming languages, enabling the direct use of ontology concepts in software engineering. We emphasize simplicity and developer experience in promoting AI interoperability.
- [248] arXiv:2405.19878 [pdf, ps, html, other]
-
Title: Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion ModelsSubjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Generative models such as diffusion have been employed as world models in offline reinforcement learning to generate synthetic data for more effective learning. Existing work either generates diffusion models one-time prior to training or requires additional interaction data to update it. In this paper, we propose a novel approach for offline reinforcement learning with closed-loop policy evaluation and world-model adaptation. It iteratively leverages a guided diffusion world model to directly evaluate the offline target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. We analyzed the performance of the proposed method and provided an upper bound on the return gap between our method and the real environment under an optimal policy. The result sheds light on various factors affecting learning performance. Evaluations in the D4RL environment show significant improvement over state-of-the-art baselines, especially when only random or medium-expertise demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.
- [249] arXiv:2405.19882 [pdf, ps, html, other]
-
Title: PixOOD: Pixel-Level Out-of-Distribution DetectionComments: under review at ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a dense image prediction out-of-distribution detection algorithm, called PixOOD, which does not require training on samples of anomalous data and is not designed for a specific application which avoids traditional training biases. In order to model the complex intra-class variability of the in-distribution data at the pixel level, we propose an online data condensation algorithm which is more robust than standard K-means and is easily trainable through SGD. We evaluate PixOOD on a wide range of problems. It achieved state-of-the-art results on four out of seven datasets, while being competitive on the rest. The source code is available at this https URL.
- [250] arXiv:2405.19883 [pdf, ps, html, other]
-
Title: From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous SystemsComments: Accepted by ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. Under proper assumptions on the pretraining data, we prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning. Additionally, we highlight the necessity for exploration beyond the subgoals derived from BAIL by proving that naively executing the subgoals returned by LLM leads to a linear regret. As a remedy, we introduce an $\epsilon$-greedy exploration strategy to BAIL, which is proven to incur sublinear regret when the pretraining error is small. Finally, we extend our theoretical framework to include scenarios where the LLM Planner serves as a world model for inferring the transition model of the environment and to multi-agent settings, enabling coordination among multiple Actors.
- [251] arXiv:2405.19885 [pdf, ps, html, other]
-
Title: Fourier Controller Networks for Real-Time Decision-Making in Embodied LearningSubjects: Machine Learning (cs.LG); Robotics (cs.RO)
Reinforcement learning is able to obtain generalized low-level robot policies on diverse robotics datasets in embodied learning scenarios, and Transformer has been widely used to model time-varying features. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that utilizes the Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. We further achieve parallel training and efficient recurrent inference by using FFT and Sliding DFT methods in the model architecture for real-time decision-making. Comprehensive analyses in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found this https URL.
- [252] arXiv:2405.19886 [pdf, ps, html, other]
-
Title: Federated Learning with Multi-resolution Model BroadcastSubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
In federated learning, a server must periodically broadcast a model to the agents. We propose to use multi-resolution coding and modulation (also known as non-uniform modulation) for this purpose. In the simplest instance, broadcast transmission is used, whereby all agents are targeted with one and the same transmission (typically without any particular favored beam direction), which is coded using multi-resolution coding/modulation. This enables high-SNR agents, with high path gains to the server, to receive a more accurate model than the low-SNR agents do, without consuming more downlink resources. As one implementation, we use transmission with a non-uniform 8-PSK constellation, where a high-SNR receiver (agent) can separate all 8 constellation points (hence receive 3 bits) whereas a low-SNR receiver can only separate 4 points (hence receive 2 bits). By encoding the least significant information in the third bit, the high-SNR receivers can obtain the model with higher accuracy, while the low-SNR receiver can still obtain the model although with reduced accuracy, thereby facilitating at least some basic participation of the low-SNR receiver. We show the effectiveness of our proposed scheme via experimentation using federated learning with the MNIST data-set.
- [253] arXiv:2405.19888 [pdf, ps, html, other]
-
Title: Parrot: Efficient Serving of LLM-based Applications with Semantic VariableComments: To appear on USENIX OSDI 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications.
This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications. - [254] arXiv:2405.19893 [pdf, ps, html, other]
-
Title: Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered ThoughtsChunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, Lei Liang, Jun ZhouComments: 12 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
In recent years, large language models (LLMs) have made remarkable achievements in various domains. However, the untimeliness and cost of knowledge updates coupled with hallucination issues of LLMs have curtailed their applications in knowledge intensive tasks, where retrieval augmented generation (RAG) can be of help. Nevertheless, existing retrieval augmented models typically use similarity as a bridge between queries and documents and follow a retrieve then read procedure. In this work, we argue that similarity is not always the panacea and totally relying on similarity would sometimes degrade the performance of retrieval augmented generation. To this end, we propose MetRag, a Multi layEred Thoughts enhanced Retrieval Augmented Generation framework. To begin with, beyond existing similarity oriented thought, we embrace a small scale utility model that draws supervision from an LLM for utility oriented thought and further come up with a smarter model by comprehensively combining the similarity and utility oriented thoughts. Furthermore, given the fact that the retrieved document set tends to be huge and using them in isolation makes it difficult to capture the commonalities and characteristics among them, we propose to make an LLM as a task adaptive summarizer to endow retrieval augmented generation with compactness-oriented thought. Finally, with multi layered thoughts from the precedent stages, an LLM is called for knowledge augmented generation. Extensive experiments on knowledge-intensive tasks have demonstrated the superiority of MetRag.
- [255] arXiv:2405.19895 [pdf, ps, html, other]
-
Title: Dispersion of personal spacesComments: This is the preprint of the paper presented at the 6th International Conference on the Dynamics of Information Systems (DIS 2023), September 3-6, 2023, Prague, Czech Republic. The paper was accepted for publication in conference proceedings in Lecture Notes in Computer ScienceSubjects: Multiagent Systems (cs.MA)
There are many entities that disseminate in the physical space - information, gossip, mood, innovation etc. Personal spaces are also entities that disperse and interplay. In this work we study the emergence of configurations formed by participants when choosing a place to sit in a rectangular auditorium. Based on experimental questionnaire data we design several models and assess their relevancy to a real time-lapse footage of lecture hall being filled up. The main focus is to compare the evolution of entropy of occupied seat configurations in time. Even though the process of choosing a seat is complex and could depend on various properties of participants or environment, some of the developed models can capture at least basic essence of the real processes. After introducing the problem of seat selection and related results in close research areas, we introduce preliminary collected data and build models of seat selection based on them. We compare the resulting models to the real observational data and discuss areas of future research directions.
- [256] arXiv:2405.19896 [pdf, ps, other]
-
Title: Fast Numerical Approximation of Linear, Second-Order Hyperbolic Problems Using Model Order Reduction and the Laplace TransformComments: arXiv admin note: text overlap with arXiv:2403.02847Subjects: Numerical Analysis (math.NA)
We extend our previous work [F. Henríquez and J. S. Hesthaven, arXiv:2403.02847 (2024)] to the linear, second-order wave equation in bounded domains. This technique, referred to as the Laplace Transform Reduced Basis (LT-RB) method, uses two widely known mathematical tools to construct a fast and efficient method for the solution of linear, time-dependent problems: The Laplace transform and the Reduced Basis method, hence the name.
The application of the Laplace transform yields a time-independent problem parametrically depending on the Laplace variable. Following the two-phase paradigm of the RB method, firstly in an offline stage we sample the Laplace parameter, compute the full-order or high-fidelity solution, and then resort to a Proper Orthogonal Decomposition (POD) to extract a basis of reduced dimension. Then, in an online phase, we project the time-dependent problem onto this basis and proceed to solve the evolution problem using any suitable time-stepping method. We prove exponential convergence of the reduced solution computed by the LT-RB method toward the high-fidelity one as the dimension of the reduced space increases.
Finally, we present a set of numerical experiments portraying the performance of the method in terms of accuracy and, in particular, speed-up when compared to the full-order model. - [257] arXiv:2405.19899 [pdf, ps, html, other]
-
Title: Open-Set Domain Adaptation for Semantic SegmentationComments: 14 pages, 5 figures, 13 tables, CVPR 2024 PosterSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer the pixel-wise knowledge from the labeled source domain to the unlabeled target domain. However, current UDA methods typically assume a shared label space between source and target, limiting their applicability in real-world scenarios where novel categories may emerge in the target domain. In this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) for the first time, where the target domain includes unknown classes. We identify two major problems in the OSDA-SS scenario as follows: 1) the existing UDA methods struggle to predict the exact boundary of the unknown classes, and 2) they fail to accurately predict the shape of the unknown classes. To address these issues, we propose Boundary and Unknown Shape-Aware open-set domain adaptation, coined BUS. Our BUS can accurately discern the boundaries between known and unknown classes in a contrastive manner using a novel dilation-erosion-based contrastive loss. In addition, we propose OpenReMix, a new domain mixing augmentation method that guides our model to effectively learn domain and size-invariant features for improving the shape detection of the known and unknown classes. Through extensive experiments, we demonstrate that our proposed BUS effectively detects unknown classes in the challenging OSDA-SS scenario compared to the previous methods by a large margin. The code is available at this https URL.
- [258] arXiv:2405.19901 [pdf, ps, html, other]
-
Title: Urban Air Pollution Forecasting: a Machine Learning Approach leveraging Satellite Observations and Meteorological ForecastsComments: 5 pages, 2 figures, submitted to IEEE MetroLivEnv 2024Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Air pollution poses a significant threat to public health and well-being, particularly in urban areas. This study introduces a series of machine-learning models that integrate data from the Sentinel-5P satellite, meteorological conditions, and topological characteristics to forecast future levels of five major pollutants. The investigation delineates the process of data collection, detailing the combination of diverse data sources utilized in the study. Through experiments conducted in the Milan metropolitan area, the models demonstrate their efficacy in predicting pollutant levels for the forthcoming day, achieving a percentage error of around 30%. The proposed models are advantageous as they are independent of monitoring stations, facilitating their use in areas without existing infrastructure. Additionally, we have released the collected dataset to the public, aiming to stimulate further research in this field. This research contributes to advancing our understanding of urban air quality dynamics and emphasizes the importance of amalgamating satellite, meteorological, and topographical data to develop robust pollution forecasting models.
- [259] arXiv:2405.19902 [pdf, ps, html, other]
-
Title: Learning Discriminative Dynamics with Label Corruption for Noisy Label DetectionComments: Accepted to CVPR 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Label noise, commonly found in real-world datasets, has a detrimental impact on a model's generalization. To effectively detect incorrectly labeled instances, previous works have mostly relied on distinguishable training signals, such as training loss, as indicators to differentiate between clean and noisy labels. However, they have limitations in that the training signals incompletely reveal the model's behavior and are not effectively generalized to various noise types, resulting in limited detection accuracy. In this paper, we propose DynaCor framework that distinguishes incorrectly labeled instances from correctly labeled ones based on the dynamics of the training signals. To cope with the absence of supervision for clean and noisy labels, DynaCor first introduces a label corruption strategy that augments the original dataset with intentionally corrupted labels, enabling indirect simulation of the model's behavior on noisy labels. Then, DynaCor learns to identify clean and noisy instances by inducing two clearly distinguishable clusters from the latent representations of training dynamics. Our comprehensive experiments show that DynaCor outperforms the state-of-the-art competitors and shows strong robustness to various noise types and noise rates.
- [260] arXiv:2405.19909 [pdf, ps, html, other]
-
Title: Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement LearningComments: ICML 2024, 19 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at this https URL.
- [261] arXiv:2405.19911 [pdf, ps, html, other]
-
Title: Full weight spectrum one-orbit cyclic subspace codesSubjects: Information Theory (cs.IT); Combinatorics (math.CO)
For a linear Hamming metric code of length n over a finite field, the number of distinct weights of its codewords is at most n. The codes achieving the equality in the above bound were called full weight spectrum codes. In this paper we will focus on the analogous class of codes within the framework of cyclic subspace codes. Cyclic subspace codes have garnered significant attention, particularly for their applications in random network coding to correct errors and erasures. We investigate one-orbit cyclic subspace codes that are full weight spectrum in this context. Utilizing number theoretical results and combinatorial arguments, we provide a complete classification of full weight spectrum one-orbit cyclic subspace codes.
- [262] arXiv:2405.19914 [pdf, ps, html, other]
-
Title: Towards RGB-NIR Cross-modality Image Registration and BeyondComments: 18 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration, which is crucial for many downstream vision tasks to fully leverage the complementary information present in visible and infrared images. In this field, researchers face two primary challenges - the absence of a correctly-annotated benchmark with viewpoint variations for evaluating RGB-NIR cross-modality registration methods and the problem of inconsistent local features caused by the appearance discrepancy between RGB-NIR cross-modality images. To address these challenges, we first present the RGB-NIR Image Registration (RGB-NIR-IRegis) benchmark, which, for the first time, enables fair and comprehensive evaluations for the task of RGB-NIR cross-modality image registration. Evaluations of previous methods highlight the significant challenges posed by our RGB-NIR-IRegis benchmark, especially on RGB-NIR image pairs with viewpoint variations. To analyze the causes of the unsatisfying performance, we then design several metrics to reveal the toxic impact of inconsistent local features between visible and infrared images on the model performance. This further motivates us to develop a baseline method named Semantic Guidance Transformer (SGFormer), which utilizes high-level semantic guidance to mitigate the negative impact of local inconsistent features. Despite the simplicity of our motivation, extensive experimental results show the effectiveness of our method.
- [263] arXiv:2405.19915 [pdf, ps, html, other]
-
Title: P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision TransformerSubjects: Artificial Intelligence (cs.AI)
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose \emph{P$^2$-ViT}, the first \underline{P}ower-of-Two (PoT) \underline{p}ost-training quantization and acceleration framework to accelerate fully quantized ViTs. Specifically, {as for quantization,} we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the re-quantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency trade-offs. {In terms of hardware,} we develop {a dedicated chunk-based accelerator} featuring multiple tailored sub-processors to individually handle ViTs' different types of operations, alleviating reconfigurable overhead. Additionally, we design {a tailored row-stationary dataflow} to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P$^2$-ViT's effectiveness. {Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. Besides, we achieve up to $\mathbf{10.1\times}$ speedup and $\mathbf{36.8\times}$ energy saving over GPU's Turing Tensor Cores, and up to $\mathbf{1.84\times}$ higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \url{this https URL}.
- [264] arXiv:2405.19917 [pdf, ps, html, other]
-
Title: Multimodal Cross-Domain Few-Shot Learning for Egocentric Action RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV)
We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (\eg, daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: this https URL
- [265] arXiv:2405.19919 [pdf, ps, html, other]
-
Title: Unraveling the Impact of Heterophilic Structures on Graph Positive-Unlabeled LearningComments: ICML 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
While Positive-Unlabeled (PU) learning is vital in many real-world scenarios, its application to graph data still remains under-explored. We unveil that a critical challenge for PU learning on graph lies on the edge heterophily, which directly violates the irreducibility assumption for Class-Prior Estimation (class prior is essential for building PU learning algorithms) and degenerates the latent label inference on unlabeled nodes during classifier training. In response to this challenge, we introduce a new method, named Graph PU Learning with Label Propagation Loss (GPL). Specifically, GPL considers learning from PU nodes along with an intermediate heterophily reduction, which helps mitigate the negative impact of the heterophilic structure. We formulate this procedure as a bilevel optimization that reduces heterophily in the inner loop and efficiently learns a classifier in the outer loop. Extensive experiments across a variety of datasets have shown that GPL significantly outperforms baseline methods, confirming its effectiveness and superiority.
- [266] arXiv:2405.19921 [pdf, ps, html, other]
-
Title: MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and MotionSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.
- [267] arXiv:2405.19928 [pdf, ps, html, other]
-
Title: BAN: Detecting Backdoors Activated by Adversarial Neuron NoiseSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.\url{https://anonymous.4open.science/r/ban-4B32}
- [268] arXiv:2405.19931 [pdf, ps, html, other]
-
Title: Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural NetworksComments: Preprint. Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by theoretically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks.
- [269] arXiv:2405.19933 [pdf, ps, html, other]
-
Title: Learning Latent Graph Structures and their UncertaintySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Within a prediction task, Graph Neural Networks (GNNs) use relational information as an inductive bias to enhance the model's accuracy. As task-relevant relations might be unknown, graph structure learning approaches have been proposed to learn them while solving the downstream prediction task. In this paper, we demonstrate that minimization of a point-prediction loss function, e.g., the mean absolute error, does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that a suitable loss function on the stochastic model outputs simultaneously grants (i) the unknown adjacency matrix latent distribution and (ii) optimal performance on the prediction task. Finally, we propose a sampling-based method that solves this joint learning task. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach.
- [270] arXiv:2405.19934 [pdf, ps, html, other]
-
Title: Estimating Population Burden of Stroke with an Agent-Based ModelComments: 18th Social Simulation ConferenceSubjects: Computers and Society (cs.CY)
Stroke is one of the leading causes of death and disability worldwide but it is believed to be highly preventable. The majority of stroke prevention focuses on targeting high-risk individuals but its is important to understand how the targeting of high-risk individuals might impact the overall societal burden of stroke. We propose using an agent-based model that follows agents through their pre-stroke and stroke journey to assess the impacts of different interventions at the population level. We present a case study looking at the impacts of agents being informed of their stroke risk at certain ages and those agents taking measure to reduce their risk. The results of our study show that if agents are aware of their risk and act accordingly we see a significant reduction in strokes and population DALYs. The case study highlights the importance of individuals understanding their own stroke risk for stroke prevention and the usefulness of agent-based models in assessing the impact of stroke interventions.
- [271] arXiv:2405.19941 [pdf, ps, html, other]
-
Title: Synthetic Patients: Simulating Difficult Conversations with Multimodal Generative AI for Medical EducationSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
Problem: Effective patient-centered communication is a core competency for physicians. However, both seasoned providers and medical trainees report decreased confidence in leading conversations on sensitive topics such as goals of care or end-of-life discussions. The significant administrative burden and the resources required to provide dedicated training in leading difficult conversations has been a long-standing problem in medical education.
Approach: In this work, we present a novel educational tool designed to facilitate interactive, real-time simulations of difficult conversations in a video-based format through the use of multimodal generative artificial intelligence (AI). Leveraging recent advances in language modeling, computer vision, and generative audio, this tool creates realistic, interactive scenarios with avatars, or "synthetic patients." These synthetic patients interact with users throughout various stages of medical care using a custom-built video chat application, offering learners the chance to practice conversations with patients from diverse belief systems, personalities, and ethnic backgrounds.
Outcomes: While the development of this platform demanded substantial upfront investment in labor, it offers a highly-realistic simulation experience with minimal financial investment. For medical trainees, this educational tool can be implemented within programs to simulate patient-provider conversations and can be incorporated into existing palliative care curriculum to provide a scalable, high-fidelity simulation environment for mastering difficult conversations.
Next Steps: Future developments will explore enhancing the authenticity of these encounters by working with patients to incorporate their histories and personalities, as well as employing the use of AI-generated evaluations to offer immediate, constructive feedback to learners post-simulation. - [272] arXiv:2405.19943 [pdf, ps, html, other]
-
Title: Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution WeightingComments: AAAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.
- [273] arXiv:2405.19944 [pdf, ps, other]
-
Title: Discrete-Time I&I Adaptive Interconnection and Damping Passivity-Based Control for Nonlinearly Parameterized Port-Controlled Hamiltonian SystemsComments: 31 pages, 9 figuresSubjects: Systems and Control (eess.SY)
In this paper, a discrete-time I&I-based adaptive IDA-PBC controller for uncertain nonlinearly parameterized port-controlled Hamiltonian systems (PCH), where the parameter uncertainties are assumed in the energy function, is constructed. A proper formulation for the uncertain system dynamics is established where the uncertainties appear in nonlinearly parameterized form in the gradient of the Hamiltonian function. The adaptive IDA-PBC controller is constructed considering this formulation. For the adaptation mechanism of the IDA-PBC controller, a discrete-time parameter estimator is derived based on the immersion and invariance (I&I) approach. A structure for a free design function in the I&I-based estimator is proposed including some other free design functions. If these free design functions are selected to satisfy some conditions, derived in this paper, the Lyapunov asymptotic stability of the estimator dynamics is guaranteed. Besides, assuming these conditions are satisfied, local asymptotic stability of the closed-loop system, in a sufficiently large set is shown. The proposed method is applied to the two physical system examples and the performance of the adaptive controller is tested by simulation. It is demonstrated that the performance of the certain IDA-PBC controller is maintained by the adaptive IDA-PBC controller successfully.
- [274] arXiv:2405.19946 [pdf, ps, other]
-
Title: Learning to Discuss Strategically: A Case Study on One Night Ultimate WerewolfComments: 27 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Communication is a fundamental aspect of human society, facilitating the exchange of information and beliefs among people. Despite the advancements in large language models (LLMs), recent agents built with these often neglect the control over discussion tactics, which are essential in communication scenarios and games. As a variant of the famous communication game Werewolf, One Night Ultimate Werewolf (ONUW) requires players to develop strategic discussion policies due to the potential role changes that increase the uncertainty and complexity of the game. In this work, we first present the existence of the Perfect Bayesian Equilibria (PBEs) in two scenarios of the ONUW game: one with discussion and one without. The results showcase that the discussion greatly changes players' utilities by affecting their beliefs, emphasizing the significance of discussion tactics. Based on the insights obtained from the analyses, we propose an RL-instructed language agent framework, where a discussion policy trained by reinforcement learning (RL) is employed to determine appropriate discussion tactics to adopt. Our experimental results on several ONUW game settings demonstrate the effectiveness and generalizability of our proposed framework.
- [275] arXiv:2405.19948 [pdf, ps, html, other]
-
Title: Scalable Test Generation to Trigger Rare Targets in High-Level Synthesizable IPs for Cloud FPGAsSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
High-Level Synthesis (HLS) has transformed the development of complex Hardware IPs (HWIP) by offering abstraction and configurability through languages like SystemC/C++, particularly for Field Programmable Gate Array (FPGA) accelerators in high-performance and cloud computing contexts. These IPs can be synthesized for different FPGA boards in cloud, offering compact area requirements and enhanced flexibility. HLS enables designs to execute directly on ARM processors within modern FPGAs without the need for Register Transfer Level (RTL) synthesis, thereby conserving FPGA resources. While HLS offers flexibility and efficiency, it also introduces potential vulnerabilities such as the presence of hidden circuitry, including the possibility of hosting hardware trojans within designs. In cloud environments, these vulnerabilities pose significant security concerns such as leakage of sensitive data, IP functionality disruption and hardware damage, necessitating the development of robust testing frameworks. This research presents an advanced testing approach for HLS-developed cloud IPs, specifically targeting hidden malicious functionalities that may exist in rare conditions within the design. The proposed method leverages selective instrumentation, combining greybox fuzzing and concolic execution techniques to enhance test generation capabilities. Evaluation conducted on various HLS benchmarks, possessing characteristics of FPGA-based cloud IPs with embedded cloud related threats, demonstrates the effectiveness of our framework in detecting trojans and rare scenarios, showcasing improvements in coverage, time efficiency, memory usage, and testing costs compared to existing methods.
- [276] arXiv:2405.19949 [pdf, ps, html, other]
-
Title: Hyper-Transformer for Amodal CompletionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information. Learning shape priors is crucial for effective amodal completion, but traditional methods often rely on two-stage processes or additional information, leading to inefficiencies and potential error accumulation. To address these shortcomings, we introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks. Specifically, H-TAN uses a dual-branch structure to extract multi-scale features from both images and masks. The multi-scale features from the image branch guide the hyper transformer in learning shape priors and in generating the weights for dynamic convolution tailored to each instance. The dynamic convolution head then uses the features from the mask branch to predict precise amodal masks. We extensively evaluate our model on three benchmark datasets: KINS, COCOA-cls, and D2SA, where H-TAN demonstrated superior performance compared to existing methods. Additional experiments validate the effectiveness and stability of the novel hyper transformer in our framework.
- [277] arXiv:2405.19950 [pdf, ps, html, other]
-
Title: MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-TuningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a modular and general-purpose fusion and model merging framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for unimodal encoders that enforces lightweight dimensionality assumptions between modalities and harmonises their representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, achieves state-of-the-art results on six benchmarked multimodal biomedical tasks.
- [278] arXiv:2405.19954 [pdf, ps, html, other]
-
Title: GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection, Localization, Reasoning, and RemediationSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
A key challenge associated with Kubernetes configuration files (KCFs) is that they are often highly complex and error-prone, leading to security vulnerabilities and operational setbacks. Rule-based (RB) tools for KCF misconfiguration detection rely on static rule sets, making them inherently limited and unable to detect newly-discovered misconfigurations. RB tools also suffer from misdetection, since mistakes are likely when coding the detection rules. Recent methods for detecting and remediating KCF misconfigurations are limited in terms of their scalability and detection coverage, or due to the fact that they have high expertise requirements and do not offer automated remediation along with misconfiguration detection. Novel approaches that employ LLMs in their pipeline rely on API-based, general-purpose, and mainly commercial models. Thus, they pose security challenges, have inconsistent classification performance, and can be costly. In this paper, we propose GenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition to detecting a wide variety of KCF misconfigurations, also identifies the exact location of the misconfigurations and provides detailed reasoning about them, along with suggested remediation. When empirically compared with three industry-standard RB tools, GenKubeSec achieved equivalent precision (0.990) and superior recall (0.999). When a random sample of KCFs was examined by a Kubernetes security expert, GenKubeSec's explanations as to misconfiguration localization, reasoning and remediation were 100% correct, informative and useful. To facilitate further advancements in this domain, we share the unique dataset we collected, a unified misconfiguration index we developed for label standardization, our experimentation code, and GenKubeSec itself as an open-source tool.
- [279] arXiv:2405.19956 [pdf, ps, html, other]
-
Title: HOLMES: to Detect Adversarial Examples with Multiple DetectorsSubjects: Artificial Intelligence (cs.AI)
Deep neural networks (DNNs) can easily be cheated by some imperceptible but purposeful noise added to images, and erroneously classify them. Previous defensive work mostly focused on retraining the models or detecting the noise, but has either shown limited success rates or been attacked by new adversarial examples. Instead of focusing on adversarial images or the interior of DNN models, we observed that adversarial examples generated by different algorithms can be identified based on the output of DNNs (logits). Logit can serve as an exterior feature to train detectors. Then, we propose HOLMES (Hierarchically Organized Light-weight Multiple dEtector System) to reinforce DNNs by detecting potential adversarial examples to minimize the threats they may bring in practical. HOLMES is able to distinguish \textit{unseen} adversarial examples from multiple attacks with high accuracy and low false positive rates than single detector systems even in an adaptive model. To ensure the diversity and randomness of detectors in HOLMES, we use two methods: training dedicated detectors for each label and training detectors with top-k logits. Our effective and inexpensive strategies neither modify original DNN models nor require its internal parameters. HOLMES is not only compatible with all kinds of learning models (even only with external APIs), but also complementary to other defenses to achieve higher detection rates (may also fully protect the system against various adversarial examples).
- [280] arXiv:2405.19957 [pdf, ps, html, other]
-
Title: PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian SplattingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
As text-conditioned diffusion models (DMs) achieve breakthroughs in image, video, and 3D generation, the research community's focus has shifted to the more challenging task of text-to-4D synthesis, which introduces a temporal dimension to generate dynamic 3D objects. In this context, we identify Score Distillation Sampling (SDS), a widely used technique for text-to-3D synthesis, as a significant hindrance to text-to-4D performance due to its Janus-faced and texture-unrealistic problems coupled with high computational costs. In this paper, we propose \textbf{P}ixel-\textbf{L}evel \textbf{A}lignments for Text-to-\textbf{4D} Gaussian Splatting (\textbf{PLA4D}), a novel method that utilizes text-to-video frames as explicit pixel alignment targets to generate static 3D objects and inject motion into them. Specifically, we introduce Focal Alignment to calibrate camera poses for rendering and GS-Mesh Contrastive Learning to distill geometry priors from rendered image contrasts at the pixel level. Additionally, we develop Motion Alignment using a deformation network to drive changes in Gaussians and implement Reference Refinement for smooth 4D object surfaces. These techniques enable 4D Gaussian Splatting to align geometry, texture, and motion with generated videos at the pixel level. Compared to previous methods, PLA4D produces synthesized outputs with better texture details in less time and effectively mitigates the Janus-faced problem. PLA4D is fully implemented using open-source models, offering an accessible, user-friendly, and promising direction for 4D digital content creation. Our project page: \href{this https URL}{this https URL}.
- [281] arXiv:2405.19958 [pdf, ps, html, other]
-
Title: Multi-Aspect Controllable Text Generation with Disentangled Counterfactual AugmentationComments: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Multi-aspect controllable text generation aims to control the generated texts in attributes from multiple aspects (e.g., "positive" from sentiment and "sport" from topic). For ease of obtaining training samples, existing works neglect attribute correlations formed by the intertwining of different attributes. Particularly, the stereotype formed by imbalanced attribute correlations significantly affects multi-aspect control. In this paper, we propose MAGIC, a new multi-aspect controllable text generation method with disentangled counterfactual augmentation. We alleviate the issue of imbalanced attribute correlations during training using counterfactual feature vectors in the attribute latent space by disentanglement. During inference, we enhance attribute correlations by target-guided counterfactual augmentation to further improve multi-aspect control. Experiments show that MAGIC outperforms state-of-the-art baselines in both imbalanced and balanced attribute correlation scenarios. Our source code and data are available at this https URL.
- [282] arXiv:2405.19961 [pdf, ps, html, other]
-
Title: Collective Variable Free Transition Path Sampling with Generative Flow NetworkComments: 9 pages, 5 figures, 2 tablesSubjects: Machine Learning (cs.LG)
Understanding transition paths between meta-stable states in molecular systems is fundamental for material design and drug discovery. However, sampling these paths via molecular dynamics simulations is computationally prohibitive due to the high-energy barriers between the meta-stable states. Recent machine learning approaches are often restricted to simple systems or rely on collective variables (CVs) extracted from expensive domain knowledge. In this work, we propose to leverage generative flow networks (GFlowNets) to sample transition paths without relying on CVs. We reformulate the problem as amortized energy-based sampling over molecular trajectories and train a bias potential by minimizing the squared log-ratio between the target distribution and the generator, derived from the flow matching objective of GFlowNets. Our evaluation on three proteins (Alanine Dipeptide, Polyproline, and Chignolin) demonstrates that our approach, called TPS-GFN, generates more realistic and diverse transition paths than the previous CV-free machine learning approach.
- [283] arXiv:2405.19965 [pdf, ps, html, other]
-
Title: Several classes of BCH codes of length $n=\frac{q^{m}-1}{2}$Subjects: Information Theory (cs.IT)
BCH codes are an important class of cyclic codes, and have wide applications in communication and storage systems. In this paper, we study the negacyclic BCH code and cyclic BCH code of length $n=\frac{q^m-1}{2}$.For negacyclic BCH code, we give the dimensions of $\mathcal C_{(n,-1,\delta,0)}$ for $\widetilde{\delta} =a\frac{q^m-1}{q-1},aq^{m-1}-1$($1\leq a <\frac{q-1}{2}$) and $\widetilde{\delta} =a\frac{q^m-1}{q-1}+b\frac{q^m-1}{q^2-1},aq^{m-1}+(a+b)q^{m-2}-1$ $(2\mid m,1\leq a+b \leq q-1$,$\left\lceil \frac{q-a-2}{2}\right\rceil\geq 1)$. The dimensions of negacyclic BCH codes $\mathcal C_{(n,-1,\delta,0)}$ with few nonzeros and $\mathcal C_{(n,-1,\delta,b)}$ with $b\neq 1$ are settled.For cyclic BCH code, we give the weight distributions of extended codes $\overline{\mathcal C}_{(n,1,\delta,1)}$ for $\delta=\delta_1,\delta_2$ and the parameters of dual code $\mathcal C^{\perp}_{(n,1,\delta,1)}$ for $\delta_2\leq \delta \leq \delta_1$.
- [284] arXiv:2405.19967 [pdf, ps, html, other]
-
Title: Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-ClassificationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Detecting out-of-scope user utterances is essential for task-oriented dialogues and intent classification. Current methodologies face difficulties with the unpredictable distribution of outliers and often rely on assumptions about data distributions. We present the Dual Encoder for Threshold-Based Re-Classification (DETER) to address these challenges. This end-to-end framework efficiently detects out-of-scope intents without requiring assumptions on data distributions or additional post-processing steps. The core of DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and the Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance embeddings, which are classified through a branched neural architecture. Further, DETER generates synthetic outliers using self-supervision and incorporates out-of-scope phrases from open-domain datasets. This approach ensures a comprehensive training set for out-of-scope detection. Additionally, a threshold-based re-classification mechanism refines the model's initial predictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77 datasets demonstrate DETER's efficacy. Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown intents on Banking77. The source code has been released at this https URL\_Classification\_OOS.
- [285] arXiv:2405.19968 [pdf, ps, html, other]
-
Title: A Dynamic Logic for Information Evaluation in IntelligenceSubjects: Logic in Computer Science (cs.LO)
In the field of human intelligence, officers use an alphanumeric scale, known as the Admiralty System, to rate the credibility of messages and the reliability of their sources (NATO AJP-2.1, 2016). During this evaluation, they are expected to estimate the credibility and reliability dimensions independently of each other (NATO STANAG, 2003). However, empirical results show that officers perceive these dimensions as strongly correlated (Baker et al., 1968). More precisely, they consider credibility as playing the leading role over reliability, the importance of which is only secondary (Samet, 1975). In this paper, we present a formal evaluative procedure, called L(intel), in line with these findings. We adapt dynamic belief revision to make credibility the main dimension of evaluation and introduce dynamic operators to update credibility ratings with the source's reliability. In addition to being empirically sound, we show that L(intel) provides an effective procedure to classify intelligence messages along the descriptive taxonomy presented in Icard (2023).
- [286] arXiv:2405.19969 [pdf, ps, html, other]
-
Title: Stable semi-implicit SDC methods for conservation lawsSubjects: Numerical Analysis (math.NA)
Semi-implicit spectral deferred correction (SDC) methods provide a systematic approach to construct time integration methods of arbitrarily high order for nonlinear evolution equations including conservation laws. They converge towards $A$- or even $L$-stable collocation methods, but are often not sufficiently robust themselves. In this paper, a family of SDC methods inspired by an implicit formulation of the Lax-Wendroff method is developed. Compared to fully implicit approaches, the methods have the advantage that they only require the solution of positive definite or semi-definite linear systems. Numerical evidence suggests that the proposed semi-implicit SDC methods with Radau points are $L$-stable up to order 11 and require very little diffusion for orders 13 and 15. The excellent stability and accuracy of these methods is confirmed by numerical experiments with 1D conservation problems, including the convection-diffusion, Burgers, Euler and Navier-Stokes equations.
- [287] arXiv:2405.19970 [pdf, ps, html, other]
-
Title: Strategies to Counter Artificial Intelligence in Law Enforcement: Cross-Country Comparison of Citizens in Greece, Italy and SpainPetra Saskia Bayerl, Babak Akhgar, Ernesto La Mattina, Barbara Pirillo, Ioana Cotoi, Davide Ariu, Matteo Mauri, Jorge Garcia, Dimitris Kavallieros, Antonia Kardara, Konstantina KaragiorgouComments: 20th International Conference on Information and Knowledge Engineering (IKE'21), 3 papges, 1 figureSubjects: Artificial Intelligence (cs.AI)
This paper investigates citizens' counter-strategies to the use of Artificial Intelligence (AI) by law enforcement agencies (LEAs). Based on information from three countries (Greece, Italy and Spain) we demonstrate disparities in the likelihood of ten specific counter-strategies. We further identified factors that increase the propensity for counter-strategies. Our study provides an important new perspective to societal impacts of security-focused AI applications by illustrating the conscious, strategic choices by citizens when confronted with AI capabilities for LEAs.
- [288] arXiv:2405.19971 [pdf, ps, html, other]
-
Title: GasTrace: Detecting Sandwich Attack Malicious Accounts in EthereumSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
The openness and transparency of Ethereum transaction data make it easy to be exploited by any entities, executing malicious attacks. The sandwich attack manipulates the Automated Market Maker (AMM) mechanism, profiting from manipulating the market price through front or after-running transactions. To identify and prevent sandwich attacks, we propose a cascade classification framework GasTrace. GasTrace analyzes various transaction features to detect malicious accounts, notably through the analysis and modeling of Gas features. In the initial classification, we utilize the Support Vector Machine (SVM) with the Radial Basis Function (RBF) kernel to generate the predicted probabilities of accounts, further constructing a detailed transaction network. Subsequently, the behavior features are captured by the Graph Attention Network (GAT) technique in the second classification. Through cascade classification, GasTrace can analyze and classify the sandwich attacks. Our experimental results demonstrate that GasTrace achieves a remarkable detection and generation capability, performing an accuracy of 96.73\% and an F1 score of 95.71\% for identifying sandwich attack accounts.
- [289] arXiv:2405.19976 [pdf, ps, html, other]
-
Title: Testing in the Evolving World of DL Systems:Insights from Python GitHub ProjectsComments: 11 pages, 3 figures, The 24th IEEE International Conference on Software Quality, Reliability, and Security (QRS) 2024Subjects: Software Engineering (cs.SE)
In the ever-evolving field of Deep Learning (DL), ensuring project quality and reliability remains a crucial challenge. This research investigates testing practices within DL projects in GitHub. It quantifies the adoption of testing methodologies, focusing on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions. We analyze a subset of 300 carefully selected repositories based on quantitative and qualitative criteria. This study reports insights on the prevalence of testing practices in DL projects within the open-source community.
- [290] arXiv:2405.19977 [pdf, ps, html, other]
-
Title: Consistent Submodular MaximizationComments: To appear at ICML 24Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
Maximizing monotone submodular functions under cardinality constraints is a classic optimization task with several applications in data mining and machine learning. In this paper we study this problem in a dynamic environment with consistency constraints: elements arrive in a streaming fashion and the goal is maintaining a constant approximation to the optimal solution while having a stable solution (i.e., the number of changes between two consecutive solutions is bounded). We provide algorithms in this setting with different trade-offs between consistency and approximation quality. We also complement our theoretical results with an experimental analysis showing the effectiveness of our algorithms in real-world instances.
- [291] arXiv:2405.19978 [pdf, ps, html, other]
-
Title: Domain Adaptation with Cauchy-Schwarz DivergenceComments: Accepted by UAI-24Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance.
- [292] arXiv:2405.19982 [pdf, ps, other]
-
Title: A Deep Reinforcement Learning Approach for Trading Optimization in the Forex Market with Multi-Agent Asynchronous DistributionSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
In today's forex market traders increasingly turn to algorithmic trading, leveraging computers to seek more profits. Deep learning techniques as cutting-edge advancements in machine learning, capable of identifying patterns in financial data. Traders utilize these patterns to execute more effective trades, adhering to algorithmic trading rules. Deep reinforcement learning methods (DRL), by directly executing trades based on identified patterns and assessing their profitability, offer advantages over traditional DL approaches. This research pioneers the application of a multi-agent (MA) RL framework with the state-of-the-art Asynchronous Advantage Actor-Critic (A3C) algorithm. The proposed method employs parallel learning across multiple asynchronous workers, each specialized in trading across multiple currency pairs to explore the potential for nuanced strategies tailored to different market conditions and currency pairs. Two different A3C with lock and without lock MA model was proposed and trained on single currency and multi-currency. The results indicate that both model outperform on Proximal Policy Optimization model. A3C with lock outperforms other in single currency training scenario and A3C without Lock outperforms other in multi-currency scenario. The findings demonstrate that this approach facilitates broader and faster exploration of different currency pairs, significantly enhancing trading returns. Additionally, the agent can learn a more profitable trading strategy in a shorter time.
- [293] arXiv:2405.19988 [pdf, ps, html, other]
-
Title: Video-Language Critic: Transferable Reward Functions for Language-Conditioned RoboticsMinttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai YuanComments: 10 pages in the main text, 16 pages including references and supplementary materials. 4 figures and 3 tables in the main text, 1 table in supplementary materialsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate reinforcement learning actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.
- [294] arXiv:2405.19990 [pdf, ps, html, other]
-
Title: DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person Re-Identification in Real-WorldSubjects: Computer Vision and Pattern Recognition (cs.CV)
Person Re-Identification (ReID) systems pose a significant security risk from backdoor attacks, allowing adversaries to evade tracking or impersonate others. Beyond recognizing this issue, we investigate how backdoor attacks can be deployed in real-world scenarios, where a ReID model is typically trained on data collected in the digital domain and then deployed in a physical environment. This attack scenario requires an attack flow that embeds backdoor triggers in the digital domain realistically enough to also activate the buried backdoor in person ReID models in the physical domain. This paper realizes this attack flow by leveraging a diffusion model to generate realistic accessories on pedestrian images (e.g., bags, hats, etc.) as backdoor triggers. However, the noticeable domain gap between the triggers generated by the off-the-shelf diffusion model and their physical counterparts results in a low attack success rate. Therefore, we introduce a novel diffusion-based physical backdoor attack (DiffPhysBA) method that adopts a training-free similarity-guided sampling process to enhance the resemblance between generated and physical triggers. Consequently, DiffPhysBA can generate realistic attributes as semantic-level triggers in the digital domain and provides higher physical ASR compared to the direct paste method by 25.6% on the real-world test set. Through evaluations on newly proposed real-world and synthetic ReID test sets, DiffPhysBA demonstrates an impressive success rate exceeding 90% in both the digital and physical domains. Notably, it excels in digital stealth metrics and can effectively evade state-of-the-art defense methods.
- [295] arXiv:2405.19991 [pdf, ps, html, other]
-
Title: OpenTM: An Open-source, Single-GPU, Large-scale Thermal Microstructure Design FrameworkSubjects: Computational Engineering, Finance, and Science (cs.CE); Optimization and Control (math.OC)
Thermal microstructures are artificially engineered materials designed to manipulate and control heat flow in unconventional ways. This paper presents an educational framework, called \emph{OpenTM}, to use a single GPU for designing periodic 3D high-resolution thermal microstructures to match the predefined thermal conductivity matrices with volume fraction constraints. Specifically, we use adaptive volume fraction to make the Optimality Criteria (OC) method run stably to obtain the thermal microstructures without a large memory overhead.Practical examples with a high resolution $128 \times 128 \times 128$ run under 90 seconds per structure on an NVIDIA GeForce GTX 4070Ti GPU with a peak GPU memory of 355 MB. Our open-source, high-performance implementation is publicly accessible at \url{this https URL}, and it is easy to install using Anaconda. Moreover, we provide a Python interface to make OpenTM well-suited for novices in C/C++.
- [296] arXiv:2405.19994 [pdf, ps, html, other]
-
Title: High-order parallel-in-time method for the monodomain equation in cardiac electrophysiologySubjects: Numerical Analysis (math.NA)
Simulation of the monodomain equation, crucial for modeling the heart's electrical activity, faces scalability limits when traditional numerical methods only parallelize in space. To optimize the use of large multi-processor computers by distributing the computational load more effectively, time parallelization is essential. We introduce a high-order parallel-in-time method addressing the substantial computational challenges posed by the stiff, multiscale, and nonlinear nature of cardiac dynamics. Our method combines the semi-implicit and exponential spectral deferred correction methods, yielding a hybrid method that is extended to parallel-in-time employing the PFASST framework. We thoroughly evaluate the stability, accuracy, and robustness of the proposed parallel-in-time method through extensive numerical experiments, using practical ionic models such as the ten-Tusscher-Panfilov. The results underscore the method's potential to significantly enhance real-time and high-fidelity simulations in biomedical research and clinical applications.
- [297] arXiv:2405.19996 [pdf, ps, html, other]
-
Title: DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the WildSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Image quality assessment (IQA) plays a critical role in selecting high-quality images and guiding compression and enhancement methods in a series of applications. The blind IQA, which assesses the quality of in-the-wild images containing complex authentic distortions without reference images, poses greater challenges. Existing methods are limited to modeling a uniform distribution with local patches and are bothered by the gap between low and high-level visions (caused by widely adopted pre-trained classification networks). In this paper, we propose a novel IQA method called diffusion priors-based IQA (DP-IQA), which leverages the prior knowledge from the pre-trained diffusion model with its excellent powers to bridge semantic gaps in the perception of the visual quality of images. Specifically, we use pre-trained stable diffusion as the backbone, extract multi-level features from the denoising U-Net during the upsampling process at a specified timestep, and decode them to estimate the image quality score. The text and image adapters are adopted to mitigate the domain gap for downstream tasks and correct the information loss caused by the variational autoencoder bottleneck. Finally, we distill the knowledge in the above model into a CNN-based student model, significantly reducing the parameter to enhance applicability, with the student model performing similarly or even better than the teacher model surprisingly. Experimental results demonstrate that our DP-IQA achieves state-of-the-art results on various in-the-wild datasets with better generalization capability, which shows the superiority of our method in global modeling and utilizing the hierarchical feature clues of diffusion for evaluating image quality.
- [298] arXiv:2405.19998 [pdf, ps, html, other]
-
Title: LAGMA: LAtent Goal-guided Multi-Agent Reinforcement LearningComments: Accepted at ICML 2024Subjects: Multiagent Systems (cs.MA)
In cooperative multi-agent reinforcement learning (MARL), agents collaborate to achieve common goals, such as defeating enemies and scoring a goal. However, learning goal-reaching paths toward such a semantic goal takes a considerable amount of time in complex tasks and the trained model often fails to find such paths. To address this, we present LAtent Goal-guided Multi-Agent reinforcement learning (LAGMA), which generates a goal-reaching trajectory in latent space and provides a latent goal-guided incentive to transitions toward this reference trajectory. LAGMA consists of three major components: (a) quantized latent space constructed via a modified VQ-VAE for efficient sample utilization, (b) goal-reaching trajectory generation via extended VQ codebook, and (c) latent goal-guided intrinsic reward generation to encourage transitions towards the sampled goal-reaching path. The proposed method is evaluated by StarCraft II with both dense and sparse reward settings and Google Research Football. Empirical results show further performance improvement over state-of-the-art baselines.
- [299] arXiv:2405.20000 [pdf, ps, html, other]
-
Title: Combining physics-informed graph neural network and finite difference for solving forward and inverse spatiotemporal PDEsSubjects: Numerical Analysis (math.NA)
The great success of Physics-Informed Neural Networks (PINN) in solving partial differential equations (PDEs) has significantly advanced our simulation and understanding of complex physical systems in science and engineering. However, many PINN-like methods are poorly scalable and are limited to in-sample scenarios. To address these challenges, this work proposes a novel discrete approach termed Physics-Informed Graph Neural Network (PIGNN) to solve forward and inverse nonlinear PDEs. In particular, our approach seamlessly integrates the strength of graph neural networks (GNN), physical equations and finite difference to approximate solutions of physical systems. Our approach is compared with the PINN baseline on three well-known nonlinear PDEs (heat, Burgers and FitzHugh-Nagumo). We demonstrate the excellent performance of the proposed method to work with irregular meshes, longer time steps, arbitrary spatial resolutions, varying initial conditions (ICs) and boundary conditions (BCs) by conducting extensive numerical experiments. Numerical results also illustrate the superiority of our approach in terms of accuracy, time extrapolability, generalizability and scalability. The main advantage of our approach is that models trained in small domains with simple settings have excellent fitting capabilities and can be directly applied to more complex situations in large domains.
- [300] arXiv:2405.20003 [pdf, ps, html, other]
-
Title: Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic SimilaritiesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model's semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.
- [301] arXiv:2405.20006 [pdf, ps, html, other]
-
Title: SWIFT: A Monotonic, Flux-Form Semi-Lagrangian Tracer Transport Scheme for Flow with Large Courant NumbersSubjects: Numerical Analysis (math.NA); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)
Local conservation of mass and entropy are becoming increasingly desirable properties for modern numerical weather and climate models. This work presents a Flux-Form Semi-Lagrangian (FFSL) transport scheme, called SWIFT, that facilitates this conservation for tracer variables, whilst maintaining other vital properties such as preservation of a constant, monotonicity and positivity. Importantly, these properties all hold for large Courant numbers and multi-dimensional flow, making the scheme appropriate for use within a dynamical core which takes large time steps.
The SWIFT scheme presented here can be seen as an evolution of the FFSL methods of Leonard et al and Lin and Rood. Two-dimensional and three-dimensional schemes consist of a splitting into a sequence of one-dimensional calculations. The new SWIFT splitting presented here allows monotonic and positivity properties from the one-dimensional calculations to be inherited by the multi-dimensional scheme. These one-dimensional calculations involve separating the mass flux into terms that correspond to integer and fractional parts of the Courant number. Key to achieving conservation is coupling the transport of tracers to the transport of the fluid density, through re-use of the discrete mass flux that was calculated from the fluid density in the transport of the tracers. This work also describes how these properties can still be attained when the tracer is vertically-staggered from the density in a Charney-Phillips grid. - [302] arXiv:2405.20008 [pdf, ps, html, other]
-
Title: Sharing Key Semantics in Transformer Makes Efficient Image RestorationBin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu SebeComments: 9 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements.
- [303] arXiv:2405.20012 [pdf, ps, html, other]
-
Title: FlexiDrop: Theoretical Insights and Practical Advances in Random Dropout Method on GNNsSubjects: Machine Learning (cs.LG)
Graph Neural Networks (GNNs) are powerful tools for handling graph-type data. Recently, GNNs have been widely applied in various domains, but they also face some issues, such as overfitting, over-smoothing and non-robustness. The existing research indicates that random dropout methods are an effective way to address these issues. However, random dropout methods in GNNs still face unresolved problems. Currently, the choice of dropout rate, often determined by heuristic or grid search methods, can increase the generalization error, contradicting the principal aims of dropout. In this paper, we propose a novel random dropout method for GNNs called FlexiDrop. First, we conduct a theoretical analysis of dropout in GNNs using rademacher complexity and demonstrate that the generalization error of traditional random dropout methods is constrained by a function related to the dropout rate. Subsequently, we use this function as a regularizer to unify the dropout rate and empirical loss within a single loss function, optimizing them simultaneously. Therefore, our method enables adaptive adjustment of the dropout rate and theoretically balances the trade-off between model complexity and generalization ability. Furthermore, extensive experimental results on benchmark datasets show that FlexiDrop outperforms traditional random dropout methods in GNNs.
- [304] arXiv:2405.20013 [pdf, ps, html, other]
-
Title: Repeatable and Reliable Efforts of Accelerated Risk AssessmentSubjects: Robotics (cs.RO)
Risk assessment of a robot in controlled environments, such as laboratories and proving grounds, is a common means to assess, certify, validate, verify, and characterize the robots' safety performance before, during, and even after their commercialization in the real-world. A standard testing program that acquires the risk estimate is expected to be (i) repeatable, such that it obtains similar risk assessments of the same testing subject among multiple trials or attempts with the similar testing effort by different stakeholders, and (ii) reliable against a variety of testing subjects produced by different vendors and manufacturers. Both repeatability and reliability are fundamental and crucial for a testing algorithm's validity, fairness, and practical feasibility, especially for standardization. However, these properties are rarely satisfied or ensured, especially as the subject robots become more complex, uncertain, and varied. This issue was present in traditional risk assessments through Monte-Carlo sampling, and remains a bottleneck for the recent accelerated risk assessment methods, primarily those using importance sampling. This study aims to enhance existing accelerated testing frameworks by proposing a new algorithm that provably integrates repeatability and reliability with the already established formality and efficiency. It also features demonstrations assessing the risk of instability from frontal impacts, initiated by push-over disturbances on a controlled inverted pendulum and a 7-DoF planar bipedal robot Rabbit managed by various control algorithms.
- [305] arXiv:2405.20014 [pdf, ps, html, other]
-
Title: subMFL: Compatiple subModel Generation for Federated Learning in Device Heterogenous EnvironmentComments: 12 pages, 7 figures, European Conference on Parallel Processing, pp. between 52 and 64, Springer, 2023Subjects: Machine Learning (cs.LG)
Federated Learning (FL) is commonly used in systems with distributed and heterogeneous devices with access to varying amounts of data and diverse computing and storage capacities. FL training process enables such devices to update the weights of a shared model locally using their local data and then a trusted central server combines all of those models to generate a global model. In this way, a global model is generated while the data remains local to devices to preserve privacy. However, training large models such as Deep Neural Networks (DNNs) on resource-constrained devices can take a prohibitively long time and consume a large amount of energy. In the current process, the low-capacity devices are excluded from the training process, although they might have access to unseen data. To overcome this challenge, we propose a model compression approach that enables heterogeneous devices with varying computing capacities to participate in the FL process. In our approach, the server shares a dense model with all devices to train it: Afterwards, the trained model is gradually compressed to obtain submodels with varying levels of sparsity to be used as suitable initial global models for resource-constrained devices that were not capable of train the first dense model. This results in an increased participation rate of resource-constrained devices while the transferred weights from the previous round of training are preserved. Our validation experiments show that despite reaching about 50 per cent global sparsity, generated submodels maintain their accuracy while can be shared to increase participation by around 50 per cent.
- [306] arXiv:2405.20015 [pdf, ps, html, other]
-
Title: Efficient LLM-Jailbreaking by Introducing Visual ModalitySubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreaks that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) through the incorporation of a visual module into the target LLM. Subsequently, we conduct an efficient MLLM-jailbreak to generate jailbreaking embeddings embJS. Finally, we convert the embJS into text space to facilitate the jailbreaking of the target LLM. Compared to direct LLM-jailbreaking, our approach is more efficient, as MLLMs are more vulnerable to jailbreaking than pure LLM. Additionally, to improve the attack success rate (ASR) of jailbreaking, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class jailbreaking capabilities.
- [307] arXiv:2405.20018 [pdf, ps, other]
-
Title: Safe Multi-agent Reinforcement Learning with Natural Language ConstraintsComments: 23 pages, 6 figuresSubjects: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked. While Safe MARL has vast potential, especially in fields like robotics and autonomous vehicles, its full potential is limited by the need to define constraints in pre-designed mathematical terms, which requires extensive domain expertise and reinforcement learning knowledge, hindering its broader adoption. To address this limitation and make Safe MARL more accessible and adaptable, we propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL). Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings that capture the essence of prohibited states and behaviours. These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a multi-task benchmark designed to assess the performance of multiple agents in adhering to natural language constraints. Empirical evaluations across various environments demonstrate that SMALL achieves comparable rewards and significantly fewer constraint violations, highlighting its effectiveness in understanding and enforcing natural language constraints.
- [308] arXiv:2405.20020 [pdf, ps, html, other]
-
Title: About truncated Chebyshev spectral method for solving numerical differentiation and summationSubjects: Numerical Analysis (math.NA)
The problems of optimal recovering univariate functions and their derivatives are studied. To solve these problems, two variants of the truncation method are constructed, which are order-optimal both in the sense of accuracy and in terms of the amount of involved Galerkin information. For numerical summation, it has been established how the parameters characterizing the problem being solved affect its stability.
- [309] arXiv:2405.20024 [pdf, ps, html, other]
-
Title: Applications of Generative AI (GAI) for Mobile and Wireless Networking: A SurveyThai-Hoc Vu, Senthil Kumar Jagatheesaperumal, Minh-Duong Nguyen, Nguyen Van Huynh, Sunghwan Kim, Quoc-Viet PhamSubjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
The success of Artificial Intelligence (AI) in multiple disciplines and vertical domains in recent years has promoted the evolution of mobile networking and the future Internet toward an AI-integrated Internet-of-Things (IoT) era. Nevertheless, most AI techniques rely on data generated by physical devices (e.g., mobile devices and network nodes) or specific applications (e.g., fitness trackers and mobile gaming). To bypass this circumvent, Generative AI (GAI), a.k.a. AI-generated content (AIGC), has emerged as a powerful AI paradigm; thanks to its ability to efficiently learn complex data distributions and generate synthetic data to represent the original data in various forms. This impressive feature is projected to transform the management of mobile networking and diversify the current services and applications provided. On this basis, this work presents a concise tutorial on the role of GAIs in mobile and wireless networking. In particular, this survey first provides the fundamentals of GAI and representative GAI models, serving as an essential preliminary to the understanding of the applications of GAI in mobile and wireless networking. Then, this work provides a comprehensive review of state-of-the-art studies and GAI applications in network management, wireless security, semantic communication, and lessons learned from the open literature. Finally, this work summarizes the current research on GAI for mobile and wireless networking by outlining important challenges that need to be resolved to facilitate the development and applicability of GAI in this edge-cutting area.
- [310] arXiv:2405.20025 [pdf, ps, html, other]
-
Title: From Forest to Zoo: Great Ape Behavior Recognition with ChimpBehaveComments: CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling In conjunction with Computer Vision and Pattern Recognition 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
This paper addresses the significant challenge of recognizing behaviors in non-human primates, specifically focusing on chimpanzees. Automated behavior recognition is crucial for both conservation efforts and the advancement of behavioral research. However, it is significantly hindered by the labor-intensive process of manual video annotation. Despite the availability of large-scale animal behavior datasets, the effective application of machine learning models across varied environmental settings poses a critical challenge, primarily due to the variability in data collection contexts and the specificity of annotations.
In this paper, we introduce ChimpBehave, a novel dataset featuring over 2 hours of video (approximately 193,000 video frames) of zoo-housed chimpanzees, meticulously annotated with bounding boxes and behavior labels for action recognition. ChimpBehave uniquely aligns its behavior classes with existing datasets, allowing for the study of domain adaptation and cross-dataset generalization methods between different visual settings. Furthermore, we benchmark our dataset using a state-of-the-art CNN-based action recognition model, providing the first baseline results for both within and cross-dataset settings. The dataset, models, and code can be accessed at: this https URL - [311] arXiv:2405.20026 [pdf, ps, html, other]
-
Title: The CFG Complexity of Singleton SetsSubjects: Formal Languages and Automata Theory (cs.FL)
Let G be a context-free grammar (CFG) in Chomsky normal form. We take the number of rules in G to be the size of G. We also assume all CFGs are in Chomsky normal form.
We consider the question of, given a string w of length n, what is the smallest CFG such that L(G)={w}? We show the following:
1) For all w, |w|=n, there is a CFG of size with O(n/log n) rules, such that L(G)={w}.
2) There exists a string w, |w|=n, such that every CFG G with L(G)={w} is of size Omega(n/log n). We give two proofs of: one nonconstructive, the other constructive. - [312] arXiv:2405.20027 [pdf, ps, html, other]
-
Title: SEA Cache: A Performance-Efficient Countermeasure for Contention-based AttacksSubjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Many cache designs have been proposed to guard against contention-based side-channel attacks. One well-known type of cache is the randomized remapping cache. Many randomized remapping caches provide fixed or over protection, which leads to permanent performance degradation, or they provide flexible protection, but sacrifice performance against strong contention-based attacks. To improve the secure cache design, we extend an existing secure cache design, CEASER-SH cache, and propose the SEA cache. The novel cache configurations in both caches are logical associativity, which allows the cache line to be placed not only in its mapped cache set but also in the subsequent cache sets. SEA cache allows each user or each process to have a different local logical associativity. Hence, only those users or processes that request extra protection against contention-based attacks are protected with high logical associativity. Other users or processes can access the cache with lower latency and higher performance. Compared to a CEASER-SH cache with logical associativity of 8, an SEA cache with logical associativity of 1 for normal protection users and 16 for high protection users has a Cycles Per Instruction penalty that is about 0.6% less for users under normal protections and provides better security against contention-based attacks. Based on a 45nm technology library, and compared to a conventional cache, we estimate the power overhead is about 20% and the area overhead is 3.4%.
- [313] arXiv:2405.20028 [pdf, ps, html, other]
-
Title: A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of $\Theta(T^{2/3})$ and its Application to Best-of-Both-WorldsComments: 31 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of $\Theta(\sqrt{T})$ for the number of rounds $T$, and there are only a few studies on adaptive learning rates for problems with a minimax regret of $\Theta(T^{2/3})$, which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of $\Theta(T^{2/3})$. Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of $\Theta(T^{2/3})$. As applications of this framework, we consider two major problems dealing with indirect feedback: partial monitoring and graph bandits. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of $\Theta(T^{2/3})$.
- [314] arXiv:2405.20029 [pdf, ps, other]
-
Title: A Random Forest-based Prediction Model for Turning Points in Antagonistic event-group CompetitionsSubjects: Machine Learning (cs.LG)
At present, most of the prediction studies related to antagonistic event-group competitions focus on the prediction of competition results, and less on the prediction of the competition process, which can not provide real-time feedback of the athletes' state information in the actual competition, and thus can not analyze the changes of the competition situation. In order to solve this problem, this paper proposes a prediction model based on Random Forest for the turning point of the antagonistic event-group. Firstly, the quantitative equation of competitive potential energy is proposed; Secondly, the quantitative value of competitive potential energy is obtained by using the dynamic combination of weights method, and the turning point of the competition situation of the antagonistic event-group is marked according to the quantitative time series graph; Finally, the random forest prediction model based on the optimisation of the KM-SMOTE algorithm and the grid search method is established. The experimental analysis shows that: the quantitative equation of competitive potential energy can effectively reflect the dynamic situation of the competition; The model can effectively predict the turning point of the competition situation of the antagonistic event-group, and the recall rate of the model in the test set is 86.13%; the model has certain significance for the future study of the competition situation of the antagonistic event-group.
- [315] arXiv:2405.20030 [pdf, ps, html, other]
-
Title: EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric VideosSubjects: Computer Vision and Pattern Recognition (cs.CV)
Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by $7.0$\% on cross-dataset evaluations. Project page: this https URL
- [316] arXiv:2405.20031 [pdf, ps, html, other]
-
Title: Structure Gaussian SLAM with Manhattan World HypothesisSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Gaussian SLAM systems have made significant advancements in improving the efficiency and fidelity of real-time reconstructions. However, these systems often encounter incomplete reconstructions in complex indoor environments, characterized by substantial holes due to unobserved geometry caused by obstacles or limited view angles. To address this challenge, we present Manhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan World hypothesis to enhance geometric accuracy and completeness. By seamlessly integrating fused line segments derived from structured scenes, MG-SLAM ensures robust tracking in textureless indoor areas. Moreover, The extracted lines and planar surface assumption allow strategic interpolation of new Gaussians in regions of missing geometry, enabling efficient scene completion. Extensive experiments conducted on both synthetic and real-world scenes demonstrate that these advancements enable our method to achieve state-of-the-art performance, marking a substantial improvement in the capabilities of Gaussian SLAM systems.
- [317] arXiv:2405.20032 [pdf, ps, html, other]
-
Title: Promptus: Can Prompts Streaming Replace Video Streaming with Stable DiffusionSubjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
With the exponential growth of video traffic, traditional video streaming systems are approaching their limits in compression efficiency and communication capacity. To further reduce bitrate while maintaining quality, we propose Promptus, a disruptive novel system that streaming prompts instead of video content with Stable Diffusion, which converts video frames into a series of "prompts" for delivery. To ensure pixel alignment, a gradient descent-based prompt fitting framework is proposed. To achieve adaptive bitrate for prompts, a low-rank decomposition-based bitrate control algorithm is introduced. For inter-frame compression of prompts, a temporal smoothing-based prompt interpolation algorithm is proposed. Evaluations across various video domains and real network traces demonstrate Promptus can enhance the perceptual quality by 0.111 and 0.092 (in LPIPS) compared to VAE and H.265, respectively, and decreases the ratio of severely distorted frames by 89.3% and 91.7%. Moreover, Promptus achieves real-time video generation from prompts at over 150 FPS. To the best of our knowledge, Promptus is the first attempt to replace video codecs with prompt inversion and the first to use prompt streaming instead of video streaming. Our work opens up a new paradigm for efficient video communication beyond the Shannon limit.
- [318] arXiv:2405.20037 [pdf, ps, html, other]
-
Title: Linguistic Landscape of Generative AI Perception: A Global Twitter Analysis Across 14 LanguagesSubjects: Computers and Society (cs.CY)
The advent of generative AI tools has had a profound impact on societies globally, transcending geographical boundaries. Understanding these tools' global reception and utilization is crucial for service providers and policymakers in shaping future policies. Therefore, to unravel the perceptions and engagements of individuals within diverse linguistic communities with regard to generative AI tools, we extensively analyzed over 6.8 million tweets in 14 different languages. Our findings reveal a global trend in the perception of generative AI, accompanied by language-specific nuances. While sentiments toward these tools vary significantly across languages, there is a prevalent positive inclination toward Image tools and a negative one toward Chat tools. Notably, the ban of ChatGPT in Italy led to a sentiment decline and initiated discussions across languages. Furthermore, we established a taxonomy for interactions with chatbots, creating a framework for social analysis underscoring variations in generative AI usage among linguistic communities. We find that the Chinese community predominantly employs chatbots as substitutes for search, while the Italian community tends to present more intricate prompts. Our research provides a robust foundation for further explorations of the social dynamics surrounding generative AI tools and offers invaluable insights for decision-makers in policy, technology, and education.
- [319] arXiv:2405.20038 [pdf, ps, html, other]
-
Title: Deep Reinforcement Learning for Intrusion Detection in IoT: A SurveyJournal-ref: 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM)Subjects: Cryptography and Security (cs.CR)
The rise of new complex attacks scenarios in Internet of things (IoT) environments necessitate more advanced and intelligent cyber defense techniques such as various Intrusion Detection Systems (IDSs) which are responsible for detecting and mitigating malicious activities in IoT networks without human intervention. To address this issue, deep reinforcement learning (DRL) has been proposed in recent years, to automatically tackle intrusions/attacks. In this paper, a comprehensive survey of DRL-based IDS on IoT is presented. Furthermore, in this survey, the state-of-the-art DRL-based IDS methods have been classified into five categories including wireless sensor network (WSN), deep Q-network (DQN), healthcare, hybrid, and other techniques. In addition, the most crucial performance metrics, namely accuracy, recall, precision, false negative rate (FNR), false positive rate (FPR), and F-measure, are detailed, in order to evaluate the performance of each proposed method. The paper provides a summary of datasets utilized in the studies as well.
- [320] arXiv:2405.20042 [pdf, ps, html, other]
-
Title: CycleFormer : TSP Solver Based on Language ModelingSubjects: Machine Learning (cs.LG)
We propose a new transformer model for the Traveling Salesman Problem (TSP) called CycleFormer. We identified distinctive characteristics that need to be considered when applying a conventional transformer model to TSP and aimed to fully incorporate these elements into the TSP-specific transformer. Unlike the token sets in typical language models, which are limited and static, the token (node) set in TSP is unlimited and dynamic. To exploit this fact to the fullest, we equated the encoder output with the decoder linear layer and directly connected the context vector of the encoder to the decoder encoding. Additionally, we added a positional encoding to the encoder tokens that reflects the two-dimensional nature of TSP, and devised a circular positional encoding for the decoder tokens that considers the cyclic properties of a tour. By incorporating these ideas, CycleFormer outperforms state-of-the-art (SOTA) transformer models for TSP from TSP-50 to TSP-500. Notably, on TSP-500, the optimality gap was reduced by approximately 2.8 times, from 3.09% to 1.10%, compared to the existing SOTA. The code will be made available at this https URL.
- [321] arXiv:2405.20044 [pdf, ps, html, other]
-
Title: A Point-Neighborhood Learning Framework for Nasal Endoscope Image SegmentationComments: 10 pages, 10 figures,Subjects: Computer Vision and Pattern Recognition (cs.CV)
The lesion segmentation on endoscopic images is challenging due to its complex and ambiguous features. Fully-supervised deep learning segmentation methods can receive good performance based on entirely pixel-level labeled dataset but greatly increase experts' labeling burden. Semi-supervised and weakly supervised methods can ease labeling burden, but heavily strengthen the learning difficulty. To alleviate this difficulty, weakly semi-supervised segmentation adopts a new annotation protocol of adding a large number of point annotation samples into a few pixel-level annotation samples. However, existing methods only mine points' limited information while ignoring reliable prior surrounding the point annotations. In this paper, we propose a weakly semi-supervised method called Point-Neighborhood Learning (PNL) framework. To mine the prior of the pixels surrounding the annotated point, we transform a single-point annotation into a circular area named a point-neighborhood. We propose point-neighborhood supervision loss and pseudo-label scoring mechanism to enhance training supervision. Point-neighborhoods are also used to augment the data diversity. Our method greatly improves performance without changing the structure of segmentation network. Comprehensive experiments show the superiority of our method over the other existing methods, demonstrating its effectiveness in point-annotated medical images. The project code will be available on: this https URL.
- [322] arXiv:2405.20045 [pdf, ps, html, other]
-
Title: Iterative Learning Control of Fast, Nonlinear, Oscillatory Dynamics (Preprint)Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
The sudden onset of deleterious and oscillatory dynamics (often called instabilities) is a known challenge in many fluid, plasma, and aerospace systems. These dynamics are difficult to address because they are nonlinear, chaotic, and are often too fast for active control schemes. In this work, we develop an alternative active controls system using an iterative, trajectory-optimization and parameter-tuning approach based on Iterative Learning Control (ILC), Time-Lagged Phase Portraits (TLPP) and Gaussian Process Regression (GPR). The novelty of this approach is that it can control a system's dynamics despite the controller being much slower than the dynamics. We demonstrate this controller on the Lorenz system of equations where it iteratively adjusts (tunes) the system's input parameters to successfully reproduce a desired oscillatory trajectory or state. Additionally, we investigate the system's dynamical sensitivity to its control parameters, identify continuous and bounded regions of desired dynamical trajectories, and demonstrate that the controller is robust to missing information and uncontrollable parameters as long as certain requirements are met. The controller presented in this work provides a framework for low-speed control for a variety of fast, nonlinear systems that may aid in instability suppression and mitigation.
- [323] arXiv:2405.20046 [pdf, ps, html, other]
-
Title: Cross-Training with Multi-View Knowledge Fusion for Heterogenous Federated LearningSubjects: Artificial Intelligence (cs.AI)
Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve the generalization capability. However, the data heterogeneity between sources may lead models to gradually forget previously acquired knowledge when undergoing cross-training to adapt to new tasks or data sources. We argue that integrating personalized and global knowledge to gather information from multiple perspectives could potentially improve performance. To achieve this goal, this paper presents a novel approach that enhances federated learning through a cross-training scheme incorporating multi-view information. Specifically, the proposed method, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.
- [324] arXiv:2405.20047 [pdf, ps, html, other]
-
Title: Schubert Subspace CodesSubjects: Information Theory (cs.IT); Combinatorics (math.CO)
In this paper, we initiate the study of constant dimension subspace codes restricted to Schubert varieties, which we call Schubert subspace codes. These codes have a very natural geometric description, as objects that we call intersecting sets with respect to a fixed subspace. We provide a geometric construction of maximum size constant dimension subspace codes in some Schubert varieties with the largest possible value for the minimum subspace distance. Finally, we generalize the problem to different values of the minimum distance.
- [325] arXiv:2405.20051 [pdf, ps, html, other]
-
Title: Threshold-Independent Fair Matching through Score CalibrationSubjects: Machine Learning (cs.LG); Databases (cs.DB)
Entity Matching (EM) is a critical task in numerous fields, such as healthcare, finance, and public administration, as it identifies records that refer to the same entity within or across different databases. EM faces considerable challenges, particularly with false positives and negatives. These are typically addressed by generating matching scores and apply thresholds to balance false positives and negatives in various contexts. However, adjusting these thresholds can affect the fairness of the outcomes, a critical factor that remains largely overlooked in current fair EM research. The existing body of research on fair EM tends to concentrate on static thresholds, neglecting their critical impact on fairness. To address this, we introduce a new approach in EM using recent metrics for evaluating biases in score based binary classification, particularly through the lens of distributional parity. This approach enables the application of various bias metrics like equalized odds, equal opportunity, and demographic parity without depending on threshold settings. Our experiments with leading matching methods reveal potential biases, and by applying a calibration technique for EM scores using Wasserstein barycenters, we not only mitigate these biases but also preserve accuracy across real world datasets. This paper contributes to the field of fairness in data cleaning, especially within EM, which is a central task in data cleaning, by promoting a method for generating matching scores that reduce biases across different thresholds.
- [326] arXiv:2405.20053 [pdf, ps, html, other]
-
Title: Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference HeadsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
- [327] arXiv:2405.20055 [pdf, ps, html, other]
-
Title: Hypergraph-Aided Task-Resource Matching for Maximizing Value of Task Completion in Collaborative IoT SystemsComments: This paper has been published in IEEE Transactions on Mobile Computing, May 2024Subjects: Systems and Control (eess.SY)
With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task-resource matching, which further deteriorate the overall task completion effectiveness. To overcome this challenge, we utilize hypergraph as a new tool to vertically unify computing, communication, and task aspects of IoT systems for an effective matching by accurately capturing the relationships between tasks and communication and computing resources. Specifically, a state-of-the-art task-resource matching hypergraph (TRM-hypergraph) model is proposed in this paper, which is used to effectively transform the process of allocating complex heterogeneous resources to convoluted tasks into a hypergraph matching problem. Taking into account computational complexity and storage, a game-theoretic hypergraph matching algorithm is proposed via considering the hypergraph matching problem as a non-cooperative multi-player clustering game. Numerical results demonstrate that the proposed TRM-hypergraph model achieves superior performance in matching of tasks and resources compared with comparison algorithms.
- [328] arXiv:2405.20057 [pdf, ps, other]
-
Title: A General Automata Model for First-Order Temporal Logics (Extended Version)Subjects: Logic in Computer Science (cs.LO)
First-order linear temporal logic (FOLTL) is a flexible and expressive formalism capable of naturally describing complex behaviors and properties. Although the logic is in general highly undecidable, the idea of using it as a specification language for the verification of complex infinite-state systems is appealing. However, a missing piece, which has proved to be an invaluable tool in dealing with other temporal logics, is an automaton model capable of capturing the logic. In this paper we address this issue, by defining and studying such a model, which we call first-order automaton. We define this very general class of automata, and the corresponding notion of regular first-order language, showing their closure under most common language-theoretic operations. We show how they can capture any FOLTL formula over any signature and theory, and provide sufficient conditions for the semi-decidability of their non-emptiness problem. Then, to show the usefulness of the formalism, we prove the decidability of monodic FOLTL, a classic result known in the literature, with a simpler and direct proof.
- [329] arXiv:2405.20058 [pdf, ps, html, other]
-
Title: Enhancing Plant Disease Detection: A Novel CNN-Based Approach with Tensor Subspace Learning and HOWSVD-MDAbdelmalik Ouamane, Ammar Chouchane, Yassine Himeur, Abderrazak Debilou, Abbes Amira, Shadi Atalla, Wathiq Mansoor, Hussain Al AhmadComments: 17 pages, 9 figures and 8 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Machine learning has revolutionized the field of agricultural science, particularly in the early detection and management of plant diseases, which are crucial for maintaining crop health and productivity. Leveraging advanced algorithms and imaging technologies, researchers are now able to identify and classify plant diseases with unprecedented accuracy and speed. Effective management of tomato diseases is crucial for enhancing agricultural productivity. The development and application of tomato disease classification methods are central to this objective. This paper introduces a cutting-edge technique for the detection and classification of tomato leaf diseases, utilizing insights from the latest pre-trained Convolutional Neural Network (CNN) models. We propose a sophisticated approach within the domain of tensor subspace learning, known as Higher-Order Whitened Singular Value Decomposition (HOWSVD), designed to boost the discriminatory power of the system. Our approach to Tensor Subspace Learning is methodically executed in two phases, beginning with HOWSVD and culminating in Multilinear Discriminant Analysis (MDA). The efficacy of this innovative method was rigorously tested through comprehensive experiments on two distinct datasets, namely PlantVillage and the Taiwan dataset. The findings reveal that HOWSVD-MDA outperforms existing methods, underscoring its capability to markedly enhance the precision and dependability of diagnosing tomato leaf diseases. For instance, up to 98.36\% and 89.39\% accuracy scores have been achieved under PlantVillage and the Taiwan datasets, respectively.
- [330] arXiv:2405.20059 [pdf, ps, html, other]
-
Title: Spectral Mapping of Singing Voices: U-Net-Assisted Vocal SegmentationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: this https URL
- [331] arXiv:2405.20062 [pdf, ps, html, other]
-
Title: Can the accuracy bias by facial hairstyle be reduced through balancing the training data?Subjects: Computer Vision and Pattern Recognition (cs.CV)
Appearance of a face can be greatly altered by growing a beard and mustache. The facial hairstyles in a pair of images can cause marked changes to the impostor distribution and the genuine distribution. Also, different distributions of facial hairstyle across demographics could cause a false impression of relative accuracy across demographics. We first show that, even though larger training sets boost the recognition accuracy on all facial hairstyles, accuracy variations caused by facial hairstyles persist regardless of the size of the training set. Then, we analyze the impact of having different fractions of the training data represent facial hairstyles. We created balanced training sets using a set of identities available in Webface42M that both have clean-shaven and facial hair images. We find that, even when a face recognition model is trained with a balanced clean-shaven / facial hair training set, accuracy variation on the test data does not diminish. Next, data augmentation is employed to further investigate the effect of facial hair distribution in training data by manipulating facial hair pixels with the help of facial landmark points and a facial hair segmentation model. Our results show facial hair causes an accuracy gap between clean-shaven and facial hair images, and this impact can be significantly different between African-Americans and Caucasians.
- [332] arXiv:2405.20065 [pdf, ps, html, other]
-
Title: Variationally Correct Neural Residual Regression for Parametric PDEs: On the Viability of Controlled AccuracySubjects: Numerical Analysis (math.NA)
This paper is about learning the parameter-to-solution map for systems of partial differential equations (PDEs) that depend on a potentially large number of parameters covering all PDE types for which a stable variational formulation (SVF) can be found. A central constituent is the notion of variationally correct residual loss function meaning that its value is always uniformly proportional to the squared solution error in the norm determined by the SVF, hence facilitating rigorous a posteriori accuracy control. It is based on a single variational problem, associated with the family of parameter dependent fiber problems, employing the notion of direct integrals of Hilbert spaces. Since in its original form the loss function is given as a dual test norm of the residual a central objective is to develop equivalent computable expressions. A first critical role is played by hybrid hypothesis classes, whose elements are piecewise polynomial in (low-dimensional) spatio-temporal variables with parameter-dependent coefficients that can be represented, e.g. by neural networks. Second, working with first order SVFs, we distinguish two scenarios: (i) the test space can be chosen as an $L_2$-space (e.g. for elliptic or parabolic problems) so that residuals live in $L_2$ and can be evaluated directly; (ii) when trial and test spaces for the fiber problems (e.g. for transport equations) depend on the parameters, we use ultraweak formulations. In combination with Discontinuous Petrov Galerkin concepts the hybrid format is then instrumental to arrive at variationally correct computable residual loss functions. Our findings are illustrated by numerical experiments representing (i) and (ii), namely elliptic boundary value problems with piecewise constant diffusion coefficients and pure transport equations with parameter dependent convection field.
- [333] arXiv:2405.20067 [pdf, ps, html, other]
-
Title: N-Dimensional Gaussians for Fitting of High Dimensional FunctionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination, or time. In this paper, we tackle these challenges for an explicit representations based on Gaussian mixture models. With our solutions, we arrive at efficient fitting of compact N-dimensional Gaussian mixtures and enable efficient evaluation at render time: For fast fitting and evaluation, we introduce a high-dimensional culling scheme that efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For adaptive refinement yet compact representation, we introduce a loss-adaptive density control scheme that incrementally guides the use of additional capacity towards missing details. With these tools we can for the first time represent complex appearance that depends on many input dimensions beyond position or viewing angle within a compact, explicit representation optimized in minutes and rendered in milliseconds.
- [334] arXiv:2405.20072 [pdf, ps, html, other]
-
Title: Faces of the Mind: Unveiling Mental Health States Through Facial Expressions in 11,427 AdolescentsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Mood disorders, including depression and anxiety, often manifest through facial expressions. While previous research has explored the connection between facial features and emotions, machine learning algorithms for estimating mood disorder severity have been hindered by small datasets and limited real-world application. To address this gap, we analyzed facial videos of 11,427 participants, a dataset two orders of magnitude larger than previous studies. This comprehensive collection includes standardized facial expression videos from reading tasks, along with a detailed psychological scale that measures depression, anxiety, and stress. By examining the relationships among these emotional states and employing clustering analysis, we identified distinct subgroups embodying different emotional profiles. We then trained tree-based classifiers and deep learning models to estimate emotional states from facial features. Results indicate that models previously effective on small datasets experienced decreased performance when applied to our large dataset, highlighting the importance of data scale and mitigating overfitting in practical settings. Notably, our study identified subtle shifts in pupil dynamics and gaze orientation as potential markers of mood disorders, providing valuable information on the interaction between facial expressions and mental health. This research marks the first large-scale and comprehensive investigation of facial expressions in the context of mental health, laying the groundwork for future data-driven advancements in this field.
- [335] arXiv:2405.20073 [pdf, ps, html, other]
-
Title: Power Allocation for Cell-Free Massive MIMO ISAC Systems with OTFS SignalComments: This work is submitted to IEEE for possible publicationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Applying integrated sensing and communication (ISAC) to a cell-free massive multiple-input multiple-output (CF mMIMO) architecture has attracted increasing attention. This approach equips CF mMIMO networks with sensing capabilities and resolves the problem of unreliable service at cell edges in conventional cellular networks. However, existing studies on CF-ISAC systems have focused on the application of traditional integrated signals. To address this limitation, this study explores the employment of the orthogonal time frequency space (OTFS) signal as a representative of innovative signals in the CF-ISAC system, and the system's overall performance is optimized and evaluated. A universal downlink spectral efficiency (SE) expression is derived regarding multi-antenna access points (APs) and optional sensing beams. To streamline the analysis and optimization of the CF-ISAC system with the OTFS signal, we introduce a lower bound on the achievable SE that is applicable to OTFS-signal-based systems. Based on this, a power allocation algorithm is proposed to maximize the minimum communication signal-to-interference-plus-noise ratio (SINR) of users while guaranteeing a specified sensing SINR value and meeting the per-AP power constraints. The results demonstrate the tightness of the proposed lower bound and the efficiency of the proposed algorithm. Finally, the superiority of using the OTFS signals is verified by a 13-fold expansion of the SE performance gap over the application of orthogonal frequency division multiplexing signals. These findings could guide the future deployment of the CF-ISAC systems, particularly in the field of millimeter waves with a large bandwidth.
- [336] arXiv:2405.20078 [pdf, ps, other]
-
Title: NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics EvaluationSubjects: Multimedia (cs.MM)
Neural radiance fields (NeRF) are a groundbreaking computer vision technology that enables the generation of high-quality, immersive visual content from multiple viewpoints. This capability holds significant advantages for applications such as virtual/augmented reality, 3D modelling and content creation for the film and entertainment industry. However, the evaluation of NeRF methods poses several challenges, including a lack of comprehensive datasets, reliable assessment methodologies, and objective quality metrics. This paper addresses the problem of NeRF quality assessment thoroughly, by conducting a rigorous subjective quality assessment test that considers several scene classes and recently proposed NeRF view synthesis methods. Additionally, the performance of a wide range of state-of-the-art conventional and learning-based full-reference 2D image and video quality assessment metrics is evaluated against the subjective scores of the subjective study. The experimental results are analyzed in depth, providing a comparative evaluation of several NeRF methods and objective quality metrics, across different classes of visual scenes, including real and synthetic content for front-face and 360-degree camera trajectories.
- [337] arXiv:2405.20079 [pdf, ps, html, other]
-
Title: Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language LearningElena Grazia Gado, Tommaso Martorella, Luca Zunino, Paola Mejia-Domenzain, Vinitra Swamy, Jibril Frej, Tanja KäserComments: Accepted as a poster paper at EDM 2024: 17th International Conference on Educational Data Mining in Atlanta, USASubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Intelligent Tutoring Systems (ITS) enhance personalized learning by predicting student answers to provide immediate and customized instruction. However, recent research has primarily focused on the correctness of the answer rather than the student's performance on specific answer choices, limiting insights into students' thought processes and potential misconceptions. To address this gap, we present MCQStudentBert, an answer forecasting model that leverages the capabilities of Large Language Models (LLMs) to integrate contextual understanding of students' answering history along with the text of the questions and answers. By predicting the specific answer choices students are likely to make, practitioners can easily extend the model to new answer choices or remove answer choices for the same multiple-choice question (MCQ) without retraining the model. In particular, we compare MLP, LSTM, BERT, and Mistral 7B architectures to generate embeddings from students' past interactions, which are then incorporated into a finetuned BERT's answer-forecasting mechanism. We apply our pipeline to a dataset of language learning MCQ, gathered from an ITS with over 10,000 students to explore the predictive accuracy of MCQStudentBert, which incorporates student interaction patterns, in comparison to correct answer prediction and traditional mastery-learning feature-based approaches. This work opens the door to more personalized content, modularization, and granular support.
- [338] arXiv:2405.20081 [pdf, ps, html, other]
-
Title: NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language ModelsKai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, Chengjie WangComments: updatingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at this https URL.
- [339] arXiv:2405.20082 [pdf, ps, html, other]
-
Title: Segment, Shuffle, and Stitch: A Simple Mechanism for Improving Time-Series RepresentationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Existing approaches for learning representations of time-series keep the temporal arrangement of the time-steps intact with the presumption that the original order is the most optimal for learning. However, non-adjacent sections of real-world time-series may have strong dependencies. Accordingly we raise the question: Is there an alternative arrangement for time-series which could enable more effective representation learning? To address this, we propose a simple plug-and-play mechanism called Segment, Shuffle, and Stitch (S3) designed to improve time-series representation learning of existing models. S3 works by creating non-overlapping segments from the original sequence and shuffling them in a learned manner that is the most optimal for the task at hand. It then re-attaches the shuffled segments back together and performs a learned weighted sum with the original input to capture both the newly shuffled sequence along with the original sequence. S3 is modular and can be stacked to create various degrees of granularity, and can be added to many forms of neural architectures including CNNs or Transformers with negligible computation overhead. Through extensive experiments on several datasets and state-of-the-art baselines, we show that incorporating S3 results in significant improvements for the tasks of time-series classification and forecasting, improving performance on certain datasets by up to 68\%. We also show that S3 makes the learning more stable with a smoother training loss curve and loss landscape compared to the original baseline. The code is available at this https URL .
- [340] arXiv:2405.20083 [pdf, ps, other]
-
Title: Tachis: Higher-Order Separation Logic with Credits for Expected CostsPhilipp G. Haselwarter, Kwing Hei Li, Markus de Medeiros, Simon Oddershede Gregersen, Alejandro Aguirre, Joseph Tassarotti, Lars BirkedalSubjects: Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
We present Tachis, a higher-order separation logic to reason about the expected cost of probabilistic programs. Inspired by the uses of time credits for reasoning about the running time of deterministic programs, we introduce a novel notion of probabilistic cost credit. Probabilistic cost credits are a separation logic resource that can be used to pay for the cost of operations in programs, and that can be distributed across all possible branches of sampling instructions according to their weight, thus enabling us to reason about expected cost. The representation of cost credits as separation logic resources gives Tachis a great deal of flexibility and expressivity. In particular, it permits reasoning about amortized expected cost by storing excess credits as potential into data structures to pay for future operations. Tachis further supports a range of cost models, including running time and entropy usage. We showcase the versatility of this approach by applying our techniques to prove upper bounds on the expected cost of a variety of probabilistic algorithms and data structures, including randomized quicksort, hash tables, and meldable heaps.
All of our results have been mechanized using Coq, Iris, and the Coquelicot real analysis library. - [341] arXiv:2405.20084 [pdf, ps, html, other]
-
Title: Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation ApproachComments: 15 pages (with references)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.
- [342] arXiv:2405.20085 [pdf, ps, html, other]
-
Title: Soft Partitioning of Latent Space for Semantic Channel EqualizationSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Multiagent Systems (cs.MA)
Semantic channel equalization has emerged as a solution to address language mismatch in multi-user semantic communications. This approach aims to align the latent spaces of an encoder and a decoder which were not jointly trained and it relies on a partition of the semantic (latent) space into atoms based on the the semantic meaning. In this work we explore the role of the semantic space partition in scenarios where the task structure involves a one-to-many mapping between the semantic space and the action space. In such scenarios, partitioning based on hard inference results results in loss of information which degrades the equalization performance. We propose a soft criterion to derive the atoms of the partition which leverages the soft decoder's output and offers a more comprehensive understanding of the semantic space's structure. Through empirical validation, we demonstrate that soft partitioning yields a more descriptive and regular partition of the space, consequently enhancing the performance of the equalization algorithm.
- [343] arXiv:2405.20089 [pdf, ps, html, other]
-
Title: The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM AbilitiesComments: Accepted to ACL 2024 (long, main)Subjects: Computation and Language (cs.CL)
Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
- [344] arXiv:2405.20090 [pdf, ps, html, other]
-
Title: Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Following the advent of the Artificial Intelligence (AI) era of large models, Multimodal Large Language Models (MLLMs) with the ability to understand cross-modal interactions between vision and text have attracted wide attention. Adversarial examples with human-imperceptible perturbation are shown to possess a characteristic known as transferability, which means that a perturbation generated by one model could also mislead another different model. Augmenting the diversity in input data is one of the most significant methods for enhancing adversarial transferability. This method has been certified as a way to significantly enlarge the threat impact under black-box conditions. Research works also demonstrate that MLLMs can be exploited to generate adversarial examples in the white-box scenario. However, the adversarial transferability of such perturbations is quite limited, failing to achieve effective black-box attacks across different models. In this paper, we propose the Typographic-based Semantic Transfer Attack (TSTA), which is inspired by: (1) MLLMs tend to process semantic-level information; (2) Typographic Attack could effectively distract the visual information captured by MLLMs. In the scenarios of Harmful Word Insertion and Important Information Protection, our TSTA demonstrates superior performance.
- [345] arXiv:2405.20091 [pdf, ps, html, other]
-
Title: Visual Attention Analysis in Online LearningComments: Accepted in CEDI 2024 (VII Congreso Español de Informática), A Coruña, SpainSubjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.
- [346] arXiv:2405.20092 [pdf, ps, html, other]
-
Title: Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code GenerationSubjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.
- [347] arXiv:2405.20093 [pdf, ps, html, other]
-
Title: Rapid Wildfire Hotspot Detection Using Self-Supervised Learning on Temporal Remote Sensing DataSubjects: Computer Vision and Pattern Recognition (cs.CV)
Rapid detection and well-timed intervention are essential to mitigate the impacts of wildfires. Leveraging remote sensed data from satellite networks and advanced AI models to automatically detect hotspots (i.e., thermal anomalies caused by active fires) is an effective way to build wildfire monitoring systems. In this work, we propose a novel dataset containing time series of remotely sensed data related to European fire events and a Self-Supervised Learning (SSL)-based model able to analyse multi-temporal data and identify hotspots in potentially near real time. We train and evaluate the performance of our model using our dataset and Thraws, a dataset of thermal anomalies including several fire events, obtaining an F1 score of 63.58.
- [348] arXiv:2405.20094 [pdf, ps, html, other]
-
Title: Low-dimensional approximations of the conditional law of Volterra processes: a non-positive curvature approachComments: Main body: 25 Pages, Appendices 29 Pages, 14 Tables, 6 FiguresSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Differential Geometry (math.DG); Computational Finance (q-fin.CP)
Predicting the conditional evolution of Volterra processes with stochastic volatility is a crucial challenge in mathematical finance. While deep neural network models offer promise in approximating the conditional law of such processes, their effectiveness is hindered by the curse of dimensionality caused by the infinite dimensionality and non-smooth nature of these problems. To address this, we propose a two-step solution. Firstly, we develop a stable dimension reduction technique, projecting the law of a reasonably broad class of Volterra process onto a low-dimensional statistical manifold of non-positive sectional curvature. Next, we introduce a sequentially deep learning model tailored to the manifold's geometry, which we show can approximate the projected conditional law of the Volterra process. Our model leverages an auxiliary hypernetwork to dynamically update its internal parameters, allowing it to encode non-stationary dynamics of the Volterra process, and it can be interpreted as a gating mechanism in a mixture of expert models where each expert is specialized at a specific point in time. Our hypernetwork further allows us to achieve approximation rates that would seemingly only be possible with very large networks.
- [349] arXiv:2405.20099 [pdf, ps, html, other]
-
Title: Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak AttacksSubjects: Cryptography and Security (cs.CR)
Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and interpretable solution applicable to various LLM platforms.
- [350] arXiv:2405.20100 [pdf, ps, html, other]
-
Title: Dynamic Slack BusSubjects: Systems and Control (eess.SY)
This letter proposes a general dynamic formulation of slack bus. With this aim, the angle constraint imposed by the slack bus is redefined as a set of differential equations and an energy source. The existence and role of the transient component of this source is also discussed in the letter. Based on this framework, the letter shows that the swing equations of synchronous machines can be interpreted as distributed, dynamic, multi-variable, local slack buses. Other relevant cases, including primary and secondary frequency regulation, passive loads as well as grid following and grid forming converters are discussed.
- [351] arXiv:2405.20101 [pdf, ps, html, other]
-
Title: Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech InpaintingSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
- [352] arXiv:2405.20104 [pdf, ps, html, other]
-
Title: Object-centric Reconstruction and Tracking of Dynamic Unknown Objects using 3D Gaussian SplattingComments: Accepted at IEEE International Conference on Space Robotics 2024Subjects: Robotics (cs.RO)
Generalizable perception is one of the pillars of high-level autonomy in space robotics. Estimating the structure and motion of unknown objects in dynamic environments is fundamental for such autonomous systems. Traditionally, the solutions have relied on prior knowledge of target objects, multiple disparate representations, or low-fidelity outputs unsuitable for robotic operations. This work proposes a novel approach to incrementally reconstruct and track a dynamic unknown object using a unified representation -- a set of 3D Gaussian blobs that describe its geometry and appearance. The differentiable 3D Gaussian Splatting framework is adapted to a dynamic object-centric setting. The input to the pipeline is a sequential set of RGB-D images. 3D reconstruction and 6-DoF pose tracking tasks are tackled using first-order gradient-based optimization. The formulation is simple, requires no pre-training, assumes no prior knowledge of the object or its motion, and is suitable for online applications. The proposed approach is validated on a dataset of 10 unknown spacecraft of diverse geometry and texture under arbitrary relative motion. The experiments demonstrate successful 3D reconstruction and accurate 6-DoF tracking of the target object in proximity operations over a short to medium duration. The causes of tracking drift are discussed and potential solutions are outlined.
- [353] arXiv:2405.20109 [pdf, ps, html, other]
-
Title: FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation ModelsEdoardo Arnaudo, Jacopo Lungo Vaschetti, Lorenzo Innocenti, Luca Barco, Davide Lisi, Vanina Fissore, Claudio RossiComments: Accepted at IGARSS 2024, 5 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Very-High Resolution (VHR) remote sensing imagery is increasingly accessible, but often lacks annotations for effective machine learning applications. Recent foundation models like GroundingDINO and Segment Anything (SAM) provide opportunities to automatically generate annotations. This study introduces FMARS (Foundation Model Annotations in Remote Sensing), a methodology leveraging VHR imagery and foundation models for fast and robust annotation. We focus on disaster management and provide a large-scale dataset with labels obtained from pre-event imagery over 19 disaster events, derived from the Maxar Open Data initiative. We train segmentation models on the generated labels, using Unsupervised Domain Adaptation (UDA) techniques to increase transferability to real-world scenarios. Our results demonstrate the effectiveness of leveraging foundation models to automatically annotate remote sensing data at scale, enabling robust downstream models for critical applications. Code and dataset are available at \url{this https URL}.
- [354] arXiv:2405.20110 [pdf, ps, other]
-
Title: Autonomous programmable microscopic electronic lablets optimized with digital controlComments: This article was originally submitted (2016) for review as one of a number of preprints as supporting information for the final review of the EU MICREAgents Project # 318671 (2012-2016). Here it is presented in slightly revised formSubjects: Robotics (cs.RO); Materials Science (cond-mat.mtrl-sci); Instrumentation and Detectors (physics.ins-det)
Lablets are autonomous microscopic particles with programmable CMOS electronics that can control electrokinetic phenomena and electrochemical reactions in solution via actuator and sensor microelectrodes. In this paper, we describe the design and fabrication of optimized singulated lablets (CMOS3) with dimensions 140x140x50 micrometers carrying an integrated coplanar encapsulated supercapacitor as a rechargeable power supply. The lablets are designed to allow docking to one another or to a smart surface for interchange of energy, electronic information, and chemicals. The paper focusses on the digital and analog design of the lablets to allow significant programmable functionality in a microscopic footprint, including the control of autonomous actuation and sensing up to the level of being able to support a complete lablet self-reproduction life cycle, although experimentally this remains to be proven. The potential of lablets in autonomous sensing and control and for evolutionary experimentation are discussed.
- [355] arXiv:2405.20112 [pdf, ps, html, other]
-
Title: RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid advances in generative AI models have empowered the creation of highly realistic images with arbitrary content, raising concerns about potential misuse and harm, such as Deepfakes. Current research focuses on training detectors using large datasets of generated images. However, these training-based solutions are often computationally expensive and show limited generalization to unseen generated images. In this paper, we propose a training-free method to distinguish between real and AI-generated images. We first observe that real images are more robust to tiny noise perturbations than AI-generated images in the representation space of vision foundation models. Based on this observation, we propose RIGID, a training-free and model-agnostic method for robust AI-generated image detection. RIGID is a simple yet effective approach that identifies whether an image is AI-generated by comparing the representation similarity between the original and the noise-perturbed counterpart. Our evaluation on a diverse set of AI-generated images and benchmarks shows that RIGID significantly outperforms existing trainingbased and training-free detectors. In particular, the average performance of RIGID exceeds the current best training-free method by more than 25%. Importantly, RIGID exhibits strong generalization across different image generation methods and robustness to image corruptions.
- [356] arXiv:2405.20114 [pdf, ps, other]
-
Title: Near Optimal Decentralized Optimization with Compression and Momentum TrackingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.
- [357] arXiv:2405.20117 [pdf, ps, html, other]
-
Title: Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark DetectionComments: 12 pages, 13 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.
- [358] arXiv:2405.20118 [pdf, ps, html, other]
-
Title: Assistance-Seeking in Human-Supervised Autonomy: Role of Trust and Secondary Task Engagement (Extended Version)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Using a dual-task paradigm, we explore how robot actions, performance, and the introduction of a secondary task influence human trust and engagement. In our study, a human supervisor simultaneously engages in a target-tracking task while supervising a mobile manipulator performing an object collection task. The robot can either autonomously collect the object or ask for human assistance. The human supervisor also has the choice to rely upon or interrupt the robot. Using data from initial experiments, we model the dynamics of human trust and engagement using a linear dynamical system (LDS). Furthermore, we develop a human action model to define the probability of human reliance on the robot. Our model suggests that participants are more likely to interrupt the robot when their trust and engagement are low during high-complexity collection tasks. Using Model Predictive Control (MPC), we design an optimal assistance-seeking policy. Evaluation experiments demonstrate the superior performance of the MPC policy over the baseline policy for most participants.
- [359] arXiv:2405.20121 [pdf, ps, html, other]
-
Title: A Structure-Aware Lane Graph Transformer Model for Vehicle Trajectory PredictionSubjects: Artificial Intelligence (cs.AI)
Accurate prediction of future trajectories for surrounding vehicles is vital for the safe operation of autonomous vehicles. This study proposes a Lane Graph Transformer (LGT) model with structure-aware capabilities. Its key contribution lies in encoding the map topology structure into the attention mechanism. To address variations in lane information from different directions, four Relative Positional Encoding (RPE) matrices are introduced to capture the local details of the map topology structure. Additionally, two Shortest Path Distance (SPD) matrices are employed to capture distance information between two accessible lanes. Numerical results indicate that the proposed LGT model achieves a significantly higher prediction performance on the Argoverse 2 dataset. Specifically, the minFDE$_6$ metric was decreased by 60.73% compared to the Argoverse 2 baseline model (Nearest Neighbor) and the b-minFDE$_6$ metric was reduced by 2.65% compared to the baseline LaneGCN model. Furthermore, ablation experiments demonstrated that the consideration of map topology structure led to a 4.24% drop in the b-minFDE$_6$ metric, validating the effectiveness of this model.
- [360] arXiv:2405.20126 [pdf, ps, html, other]
-
Title: Federated and Transfer Learning for Cancer Detection Based on Image AnalysisSubjects: Computer Vision and Pattern Recognition (cs.CV)
This review article discusses the roles of federated learning (FL) and transfer learning (TL) in cancer detection based on image analysis. These two strategies powered by machine learning have drawn a lot of attention due to their potential to increase the precision and effectiveness of cancer diagnosis in light of the growing importance of machine learning techniques in cancer detection. FL enables the training of machine learning models on data distributed across multiple sites without the need for centralized data sharing, while TL allows for the transfer of knowledge from one task to another. A comprehensive assessment of the two methods, including their strengths, and weaknesses is presented. Moving on, their applications in cancer detection are discussed, including potential directions for the future. Finally, this article offers a thorough description of the functions of TL and FL in image-based cancer detection. The authors also make insightful suggestions for additional study in this rapidly developing area.
- [361] arXiv:2405.20130 [pdf, ps, other]
-
Title: LinApart: optimizing the univariate partial fraction decompositionComments: 22 pages, 5 figuresSubjects: Symbolic Computation (cs.SC); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
We present LinApart, a routine designed for efficiently performing the univariate partial fraction decomposition of large symbolic expressions. Our method is based on an explicit closed formula for the decomposition of rational functions with fully factorized denominators. We provide implementations in both the Wolfram Mathematica and C languages, made available at this https URL . The routine can provide very significant performance gains over available tools such as the Apart command in Mathematica.
- [362] arXiv:2405.20131 [pdf, ps, html, other]
-
Title: Language Models Need Inductive Biases to Count InductivelySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.
- [363] arXiv:2405.20132 [pdf, ps, html, other]
-
Title: LLaMEA: A Large Language Model Evolutionary Algorithm for Automatically Generating MetaheuristicsComments: Submitted to IEEE TEVCSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) such as GPT-4 have demonstrated their ability to understand natural language and generate complex code snippets. This paper introduces a novel Large Language Model Evolutionary Algorithm (LLaMEA) framework, leveraging GPT models for the automated generation and refinement of algorithms. Given a set of criteria and a task definition (the search space), LLaMEA iteratively generates, mutates and selects algorithms based on performance metrics and feedback from runtime evaluations. This framework offers a unique approach to generating optimized algorithms without requiring extensive prior expertise. We show how this framework can be used to generate novel black-box metaheuristic optimization algorithms automatically. LLaMEA generates multiple algorithms that outperform state-of-the-art optimization algorithms (Covariance Matrix Adaptation Evolution Strategy and Differential Evolution) on the five dimensional black box optimization benchmark (BBOB). The results demonstrate the feasibility of the framework and identify future directions for automated generation and optimization of algorithms via LLMs.
- [364] arXiv:2405.20136 [pdf, ps, html, other]
-
Title: A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent DementiaComments: 13 pages,9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
In response to the social issue of the increasing number of elderly vulnerable groups going missing due to the aggravating aging population in China, our team has developed a wearable anti-loss device and intelligent early warning system for elderly individuals with intermittent dementia using artificial intelligence and IoT technology. This system comprises an anti-loss smart helmet, a cloud computing module, and an intelligent early warning application on the caregiver's mobile device. The smart helmet integrates a miniature camera module, a GPS module, and a 5G communication module to collect first-person images and location information of the elderly. Data is transmitted remotely via 5G, FTP, and TCP protocols. In the cloud computing module, our team has proposed for the first time a multimodal dangerous state recognition network based on scene and location information to accurately assess the risk of elderly individuals going missing. Finally, the application software interface designed for the caregiver's mobile device implements multi-level early warnings. The system developed by our team requires no operation or response from the elderly, achieving fully automatic environmental perception, risk assessment, and proactive alarming. This overcomes the limitations of traditional monitoring devices, which require active operation and response, thus avoiding the issue of the digital divide for the elderly. It effectively prevents accidental loss and potential dangers for elderly individuals with dementia.
- [365] arXiv:2405.20138 [pdf, ps, html, other]
-
Title: Separation and Collapse of Equilibria Inequalities on AND-OR Trees without Shape ConstraintsComments: 42 pages, 1 figureSubjects: Artificial Intelligence (cs.AI)
Herein, we investigate the randomized complexity, which is the least cost against the worst input, of AND-OR tree computation by imposing various restrictions on the algorithm to find the Boolean value of the root of that tree and no restrictions on the tree shape. When a tree satisfies a certain condition regarding its symmetry, directional algorithms proposed by Saks and Wigderson (1986), special randomized algorithms, are known to achieve the randomized complexity. Furthermore, there is a known example of a tree that is so unbalanced that no directional algorithm achieves the randomized complexity (Vereshchagin 1998). In this study, we aim to identify where deviations arise between the general randomized Boolean decision tree and its special case, directional algorithms. In this paper, we show that for any AND-OR tree, randomized depth-first algorithms, which form a broader class compared with directional algorithms, have the same equilibrium as that of the directional algorithms. Thus, we get the collapse result on equilibria inequalities that holds for an arbitrary AND-OR tree. This implies that there exists a case where even depth-first algorithms cannot be the fastest, leading to the separation result on equilibria inequality. Additionally, a new algorithm is introduced as a key concept for proof of the separation result.
- [366] arXiv:2405.20139 [pdf, ps, html, other]
-
Title: GNN-RAG: Graph Neural Retrieval for Large Language Model ReasoningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.
- [367] arXiv:2405.20141 [pdf, ps, html, other]
-
Title: OpenDAS: Domain Adaptation for Open-Vocabulary SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.
- [368] arXiv:2405.20142 [pdf, ps, html, other]
-
Title: MSSC-BiMamba: Multimodal Sleep Stage Classification and Early Diagnosis of Sleep Disorders with Bidirectional MambaComments: 10 pagesSubjects: Artificial Intelligence (cs.AI)
Background and Objectives: Monitoring sleep states is crucial for assessing sleep quality and diagnosing sleep disorders. Traditional manual staging methods are not only time-consuming but also subject to subjective judgment, leading to inconsistent results. This study developed an automated sleep staging and sleep disorder classification model through deep learning technology, aimed at improving diagnostic accuracy and efficiency.
Methods: Considering the characteristics of polysomnography (PSG) multi-lead sleep monitoring, we designed a sleep state classification model, MSSC-BiMamba, that combines an Efficient Channel Attention (ECA) mechanism with a Bidirectional State Space Model (BSSM). The ECA module allows for weighting data from different sensor channels, thereby amplifying the influence of diverse sensor inputs. Additionally, the implementation of mamba enables the model to effectively capture the multidimensional features and long-range dependencies of PSG data.
Results: The developed model demonstrated impressive performance on sleep stage classification tasks. Furthermore, the model exhibited an accuracy of 0.952 for sleep health prediction when evaluated on a combined dataset consisting of ISRUC and Sleep-EDF.
Conclusion: Our model is the first to apply the bidirectional Mamba to sleep staging with complex PSG data, showing substantial gains in computational and memory efficiency over traditional Transformer-style models. This method not only makes health monitoring more accessible but also broadens the reach of advanced healthcare, thereby enhancing sleep health management with innovative technology. - [369] arXiv:2405.20145 [pdf, ps, other]
-
Title: Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical TransformersComments: Accepted for publication at the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP-WS) 2024; 11 pages, 1 figure, 9 tablesSubjects: Computation and Language (cs.CL)
Historical languages present unique challenges to the NLP community, with one prominent hurdle being the limited resources available in their closed corpora. This work describes our submission to the constrained subtask of the SIGTYP 2024 shared task, focusing on PoS tagging, morphological tagging, and lemmatization for 13 historical languages. For PoS and morphological tagging we adapt a hierarchical tokenization method from Sun et al. (2023) and combine it with the advantages of the DeBERTa-V3 architecture, enabling our models to efficiently learn from every character in the training data. We also demonstrate the effectiveness of character-level T5 models on the lemmatization task. Pre-trained from scratch with limited data, our models achieved first place in the constrained subtask, nearly reaching the performance levels of the unconstrained task's winner. Our code is available at this https URL
- [370] arXiv:2405.20150 [pdf, ps, html, other]
-
Title: Can the a.c.s. notion and the GLT theory handle approximated PDEs/FDEs with either moving or unbounded domains?Comments: 57 pages, 64 figuresSubjects: Numerical Analysis (math.NA)
In the current note we consider matrix-sequences $\{B_{n,t}\}_n$ of increasing sizes depending on $n$ and equipped with a parameter $t>0$. For every fixed $t>0$, we assume that each $\{B_{n,t}\}_n$ possesses a canonical spectral/singular values symbol $f_t$ defined on $D_t\subset \R^{d}$ of finite measure, $d\ge 1$. Furthermore, we assume that $ \{ \{ B_{n,t}\} : \, t > 0 \} $ is an approximating class of sequences (a.c.s.) for $ \{ A_n \} $ and that $ \bigcup_{t > 0} D_t = D $ with $ D_{t + 1} \supset D_t $. Under such assumptions and via the notion of a.c.s, we prove results on the canonical distributions of $ \{ A_n \} $, whose symbol, when it exists, can be defined on the, possibly unbounded, domain $D$ of finite or even infinite measure.
We then extend the concept of a.c.s. to the case where the approximating sequence $ \{ B_{n,t}\}_n $ has possibly a different dimension than the one of $ \{ A_n\} $. This concept seems to be particularly natural when dealing, e.g., with the approximation both of a partial differential equation (PDE) and of its (possibly unbounded, or moving) domain $D$, using an exhausting sequence of domains $\{ D_t \}$.
Examples coming from approximated PDEs/FDEs with either moving or unbounded domains are presented in connection with the classical and the new notion of a.c.s., while numerical tests and a list of open questions conclude the present work. - [371] arXiv:2405.20152 [pdf, ps, html, other]
-
Title: Uncovering Bias in Large Vision-Language Models at Scale with CounterfactualsSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs. Our multi-dimensional analysis reveals that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence the generation of toxic content, competency-associated words, harmful stereotypes, and numerical ratings of depicted individuals. We additionally explore the relationship between social bias in LVLMs and their corresponding LLMs, as well as inference-time strategies to mitigate bias.
- [372] arXiv:2405.20155 [pdf, ps, html, other]
-
Title: MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of arbitrary 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at this https URL.
- [373] arXiv:2405.20156 [pdf, ps, other]
-
Title: Scaling up archival text analysis with the blockmodeling of n-gram networks -- A case study of Bulgaria's representation in the Osservatore Romano (January-May 1877)Comments: 7 pages, 2 figures, 1 tableJournal-ref: Bulgarski ezik i literatura [Bulgarian Language and Literature], vol. 66, no. 3, 2024Subjects: Digital Libraries (cs.DL); Applications (stat.AP); Computation (stat.CO); Other Statistics (stat.OT)
This paper seeks to bridge the gap between archival text analysis and network analysis by applying network clustering methods to analyze the coverage of Bulgaria in 123 issues of the newspaper Osservatore Romano published between January and May 1877. Utilizing optical character recognition and generalized homogeneity blockmodeling, the study constructs networks of relevant keywords. Those including the sets Bulgaria and Russia are rather isomorphic and they largely overlap with those for Germany, Britain, and War. In structural terms, the blockmodel of the two networks exhibits a clear core-semiperiphery-periphery structure that reflects relations between concepts in the newpaper's coverage. The newspaper's lexical choices effectively delegitimised the Bulgarian national revival, highlighting the influence of the Holy See on the newspaper's editorial line.
- [374] arXiv:2405.20161 [pdf, ps, html, other]
-
Title: Landslide mapping from Sentinel-2 imagery through change detectionComments: to be published in IEEE IGARSS 2024 conference proceedingsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Landslides are one of the most critical and destructive geohazards. Widespread development of human activities and settlements combined with the effects of climate change on weather are resulting in a high increase in the frequency and destructive power of landslides, making them a major threat to human life and the economy. In this paper, we explore methodologies to map newly-occurred landslides using Sentinel-2 imagery automatically. All approaches presented are framed as a bi-temporal change detection problem, requiring only a pair of Sentinel-2 images, taken respectively before and after a landslide-triggering event. Furthermore, we introduce a novel deep learning architecture for fusing Sentinel-2 bi-temporal image pairs with Digital Elevation Model (DEM) data, showcasing its promising performances w.r.t. other change detection models in the literature. As a parallel task, we address limitations in existing datasets by creating a novel geodatabase, which includes manually validated open-access landslide inventories over heterogeneous ecoregions of the world. We release both code and dataset with an open-source license.
- [375] arXiv:2405.20163 [pdf, ps, html, other]
-
Title: Reasoning about concepts with LLMs: Inconsistencies aboundComments: 15 pages, 5 figures, 3 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The ability to summarize and organize knowledge into abstract concepts is key to learning and reasoning. Many industrial applications rely on the consistent and systematic use of concepts, especially when dealing with decision-critical knowledge. However, we demonstrate that, when methodically questioned, large language models (LLMs) often display and demonstrate significant inconsistencies in their knowledge. Computationally, the basic aspects of the conceptualization of a given domain can be represented as Is-A hierarchies in a knowledge graph (KG) or ontology, together with a few properties or axioms that enable straightforward reasoning. We show that even simple ontologies can be used to reveal conceptual inconsistencies across several LLMs. We also propose strategies that domain experts can use to evaluate and improve the coverage of key domain concepts in LLMs of various sizes. In particular, we have been able to significantly enhance the performance of LLMs of various sizes with openly available weights using simple knowledge-graph (KG) based prompting strategies.
- [376] arXiv:2405.20166 [pdf, ps, html, other]
-
Title: An approximation for return time distributions of random walks on sparse networksSubjects: Social and Information Networks (cs.SI)
We propose an approximation for the first return time distribution of random walks on undirected networks. We combine a message-passing solution with a mean-field approximation, to account for the short- and long-term behaviours respectively. We test this approximation on several classes of large graphs and find excellent agreement between our approximations and the true distributions. While the statistical properties of a random walk will depend on the structure of the network, the observed agreement between our approximations and numerical calculations implies that while local structure is clearly very influential, global structure is only important in a relatively superficial way, namely through the total number of edges.
- [377] arXiv:2405.20168 [pdf, ps, html, other]
-
Title: Enhancing Battlefield Awareness: An Aerial RIS-assisted ISAC System with Deep Reinforcement LearningSubjects: Systems and Control (eess.SY)
This paper considers a joint communication and sensing technique for enhancing situational awareness in practical battlefield scenarios. In particular, we propose an aerial reconfigurable intelligent surface (ARIS)-assisted integrated sensing and communication (ISAC) system consisting of a single access point (AP), an ARIS, multiple users, and a sensing target. With deep reinforcement learning (DRL), we jointly optimize the transmit beamforming of the AP, the RIS phase shifts, and the trajectory of the ARIS under signal-to-interference-noise ratio (SINR) constraints. Numerical results demonstrate that the proposed technique outperforms the conventional benchmark schemes by suppressing the self-interference and clutter echo signals or optimizing the RIS phase shifts.
- [378] arXiv:2405.20172 [pdf, ps, html, other]
-
Title: Iterative Feature Boosting for Explainable Speech Emotion RecognitionComments: Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)Journal-ref: 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 543-549Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.
- [379] arXiv:2405.20174 [pdf, ps, html, other]
-
Title: Tropical Expressivity of Neural NetworksSubjects: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
We propose an algebraic geometric framework to study the expressivity of linear activation neural networks. A particular quantity that has been actively studied in the field of deep learning is the number of linear regions, which gives an estimate of the information capacity of the architecture. To study and evaluate information capacity and expressivity, we work in the setting of tropical geometry -- a combinatorial and polyhedral variant of algebraic geometry -- where there are known connections between tropical rational maps and feedforward neural networks. Our work builds on and expands this connection to capitalize on the rich theory of tropical geometry to characterize and study various architectural aspects of neural networks. Our contributions are threefold: we provide a novel tropical geometric approach to selecting sampling domains among linear regions; an algebraic result allowing for a guided restriction of the sampling domain for network architectures with symmetries; and an open source library to analyze neural networks as tropical Puiseux rational maps. We provide a comprehensive set of proof-of-concept numerical experiments demonstrating the breadth of neural network architectures to which tropical geometric theory can be applied to reveal insights on expressivity characteristics of a network. Our work provides the foundations for the adaptation of both theory and existing software from computational tropical geometry and symbolic computation to deep learning.
- [380] arXiv:2405.20175 [pdf, ps, html, other]
-
Title: InstructionCP: A fast approach to transfer Large Language Models into target languageComments: 10 pages, 1 figureSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model's ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.
- [381] arXiv:2405.20178 [pdf, ps, other]
-
Title: Non-intrusive data-driven model order reduction for circuits based on Hammerstein architecturesComments: 13 pages, 13 figures; submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
We demonstrate that data-driven system identification techniques can provide a basis for effective, non-intrusive model order reduction (MOR) for common circuits that are key building blocks in microelectronics. Our approach is motivated by the practical operation of these circuits and utilizes a canonical Hammerstein architecture. To demonstrate the approach we develop a parsimonious Hammerstein model for a non-linear CMOS differential amplifier. We train this model on a combination of direct current (DC) and transient Spice (Xyce) circuit simulation data using a novel sequential strategy to identify the static nonlinear and linear dynamical parts of the model. Simulation results show that the Hammerstein model is an effective surrogate for the differential amplifier circuit that accurately and efficiently reproduces its behavior over a wide range of operating points and input frequencies.
- [382] arXiv:2405.20179 [pdf, ps, html, other]
-
Title: Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Large language models (LLMs) have shown great promise at generating robot programs from natural language given domain-specific robot application programming interfaces (APIs). However, the performance gap between proprietary LLMs and smaller open-weight LLMs remains wide. This raises a question: Can we fine-tune smaller open-weight LLMs for generating domain-specific robot programs to close the performance gap with proprietary LLMs? While Self-Instruct is a promising solution by generating a diverse set of training data, it cannot verify the correctness of these programs. In contrast, a robot simulator with a well-defined world can identify execution errors but limits the diversity of programs that it can verify. In this work, we introduce Robo-Instruct, which brings the best of both worlds -- it promotes the diversity of Self-Instruct while providing the correctness of simulator-based checking. Robo-Instruct introduces RoboSim to synthesize a consistent world state on the fly by inferring properties relevant to the program being checked, and simulating actions accordingly. Furthermore, the instructions and programs generated by Self-Instruct may be subtly inconsistent -- such as the program missing a step implied by the instruction. Robo-Instruct further addresses this with InstAlign, an instruction-program alignment procedure that revises the task instruction to reflect the actual results of the generated program. Given a few seed task descriptions and the robot APIs, Robo-Instruct is capable of generating a training dataset using only a small open-weight model. This dataset can then be used to fine-tune small open-weight language models, enabling them to match or even exceed the performance of several proprietary LLMs, such as GPT-3.5-Turbo and Gemini-Pro.
- [383] arXiv:2405.20180 [pdf, ps, html, other]
-
Title: Transformers and Slot Encoding for Sample Efficient Physical World ModellingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at this https URL
- [384] arXiv:2405.20183 [pdf, ps, html, other]
-
Title: A Survey Study on the State of the Art of Programming Exercise Generation using Large Language ModelsComments: 5 pages, 0 figures, CSEE&T 2024Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
This paper analyzes Large Language Models (LLMs) with regard to their programming exercise generation capabilities. Through a survey study, we defined the state of the art, extracted their strengths and weaknesses and finally proposed an evaluation matrix, helping researchers and educators to decide which LLM is the best fitting for the programming exercise generation use case. We also found that multiple LLMs are capable of producing useful programming exercises. Nevertheless, there exist challenges like the ease with which LLMs might solve exercises generated by LLMs. This paper contributes to the ongoing discourse on the integration of LLMs in education.
- [385] arXiv:2405.20188 [pdf, ps, html, other]
-
Title: SPARE: Symmetrized Point-to-Plane Distance for Robust Non-Rigid RegistrationSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point-to-plane distance for robust non-rigid registration. The symmetrized point-to-plane distance relies on both the positions and normals of the corresponding points, resulting in a more accurate approximation of the underlying geometry and can achieve higher accuracy than existing methods. To solve this optimization problem efficiently, we propose an alternating minimization solver using a majorization-minimization strategy. Moreover, for effective initialization of the solver, we incorporate a deformation graph-based coarse alignment that improves registration quality and efficiency. Extensive experiments show that the proposed method greatly improves the accuracy of non-rigid registration problems and maintains relatively high solution efficiency. The code is publicly available at this https URL.
- [386] arXiv:2405.20189 [pdf, ps, html, other]
-
Title: Nadine: An LLM-driven Intelligent Social Robot with Affective Capabilities and Human-like MemorySubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
In this work, we describe our approach to developing an intelligent and robust social robotic system for the Nadine social robot platform. We achieve this by integrating Large Language Models (LLMs) and skilfully leveraging the powerful reasoning and instruction-following capabilities of these types of models to achieve advanced human-like affective and cognitive capabilities. This approach is novel compared to the current state-of-the-art LLM-based agents which do not implement human-like long-term memory or sophisticated emotional appraisal. The naturalness of social robots, consisting of multiple modules, highly depends on the performance and capabilities of each component of the system and the seamless integration of the components. We built a social robot system that enables generating appropriate behaviours through multimodal input processing, bringing episodic memories accordingly to the recognised user, and simulating the emotional states of the robot induced by the interaction with the human partner. In particular, we introduce an LLM-agent frame for social robots, SoR-ReAct, serving as a core component for the interaction module in our system. This design has brought forth the advancement of social robots and aims to increase the quality of human-robot interaction.
- [387] arXiv:2405.20192 [pdf, ps, html, other]
-
Title: TAIA: Large Language Models are Out-of-Distribution Data LearnersComments: 25 pagesSubjects: Computation and Language (cs.CL)
Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
- [388] arXiv:2405.20194 [pdf, ps, other]
-
Title: Occam Gradient DescentSubjects: Machine Learning (cs.LG)
Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification, and is effective in our experiments in outperforming traditional gradient descent with or without post-train pruning in accuracy, compute and model compression.
- [389] arXiv:2405.20195 [pdf, ps, html, other]
-
Title: Using Large Language Models for Humanitarian Frontline Negotiation: Opportunities and ConsiderationsZilin Ma, 1 Susannah (Cheng)Su, Nathan Zhao, Linn Bieske, Blake Bullwinkel, Yanyi Zhang, Sophia (Yanrui)Yang, Ziqing Luo, Siyao Li, Gekai Liao, Boxiang Wang, Jinglun Gao, Zihan Wen, Claude Bruderlein, Weiwei PanSubjects: Human-Computer Interaction (cs.HC)
Humanitarian negotiations in conflict zones, called \emph{frontline negotiation}, are often highly adversarial, complex, and high-risk. Several best-practices have emerged over the years that help negotiators extract insights from large datasets to navigate nuanced and rapidly evolving scenarios. Recent advances in large language models (LLMs) have sparked interest in the potential for AI to aid decision making in frontline negotiation. Through in-depth interviews with 13 experienced frontline negotiators, we identified their needs for AI-assisted case analysis and creativity support, as well as concerns surrounding confidentiality and model bias. We further explored the potential for AI augmentation of three standard tools used in frontline negotiation planning. We evaluated the quality and stability of our ChatGPT-based negotiation tools in the context of two real cases. Our findings highlight the potential for LLMs to enhance humanitarian negotiations and underscore the need for careful ethical and practical considerations.
- [390] arXiv:2405.20199 [pdf, ps, html, other]
-
Title: Generation planning and operation under power stability constraints: A Hydro-Quebec use caseAlexandre Besner, Alexandre Blondin Massé, Abderrahman Bani, Mouad Morabit, Luc Charest, David Ialongo, Simon Couture-Gagnon, Julien FournierComments: 10 pages, 3 figures, 13 tables, 1 algorithmSubjects: Systems and Control (eess.SY)
Hydro-Quebec (HQ) is a vertically integrated utility that produces, transmits, and distributes most of the electricity in the province of Quebec. The power grid it operates has a particular architecture created by large hydroelectric dams located far north and the extensive 735kV transmission grid that allows the generated power to reach the majority of the load located thousands of kilometers away in the southern region of Quebec. The specificity of the grid has led HQ to develop monitoring tools responsible for generating so-called stability limits. Those stability limits take into account several nonlinear phenomena such as angular stability, frequency stability, or voltage stability. Since generation planning and operation tools rely mostly on mixed integer linear programming formulation, HQ had to adapt its tools to integrate stability limits into them. This paper presents the challenges it faced, especially considering its reserve monitoring tool and unit commitment tool.
- [391] arXiv:2405.20200 [pdf, ps, html, other]
-
Title: Unified Explanations in Machine Learning Models: A Perturbation ApproachSubjects: Machine Learning (cs.LG)
A high-velocity paradigm shift towards Explainable Artificial Intelligence (XAI) has emerged in recent years. Highly complex Machine Learning (ML) models have flourished in many tasks of intelligence, and the questions have started to shift away from traditional metrics of validity towards something deeper: What is this model telling me about my data, and how is it arriving at these conclusions? Inconsistencies between XAI and modeling techniques can have the undesirable effect of casting doubt upon the efficacy of these explainability approaches. To address these problems, we propose a systematic, perturbation-based analysis against a popular, model-agnostic method in XAI, SHapley Additive exPlanations (Shap). We devise algorithms to generate relative feature importance in settings of dynamic inference amongst a suite of popular machine learning and deep learning methods, and metrics that allow us to quantify how well explanations generated under the static case hold. We propose a taxonomy for feature importance methodology, measure alignment, and observe quantifiable similarity amongst explanation models across several datasets.
- [392] arXiv:2405.20202 [pdf, ps, html, other]
-
Title: One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient DeploymentsSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.
- [393] arXiv:2405.20204 [pdf, ps, html, other]
-
Title: Jina CLIP: Your CLIP Model Is Also Your Text RetrieverAndreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han XiaoComments: 4 pages, ICML2024 workshop submissionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
- [394] arXiv:2405.20211 [pdf, ps, other]
-
Title: Low-rank and sparse approximations for contact mechanicsComments: Phd thesis, July 2022. Ecole Centrale de Nantes. Preferably cite using HAL ID. https://theses.hal.science/tel-03901111Subjects: Numerical Analysis (math.NA)
(Rephrased) Non-conformance decision-making processes in high-precision manufacturing of engineering structures are often delayed due to numerical simulations that are needed for analyzing the defective parts and assemblies. Interfaces between parts of assemblies can only be simulated using the modeling of contact. Thus, efficient parametric ROMs are necessary for performing contact mechanics simulations in near real-time scenarios. Typical strategies for reducing the cost of contact models use low-rank approximations. Assumptions include the existence of a low-dimensional subspace for displacement and a low-dimensional non-negative subcone for contact pressure. However, the contact pressure exhibits a local nature, as the position of contact can vary with parameters like loading or geometry. The adequacy of low-rank approximations for contact mechanics is investigated and alternative routes based on sparse regression techniques are explored. It is shown that the local nature leads to loss of linear separability of contact pressure, thereby limiting the accuracy of low-rank methods. The applicability of the low-rank assumption to contact pressure is analyzed using 3 different criteria: compactness, generalization and specificity. Subsequently, over-complete dictionaries with a large number of snapshots to mitigate the inseparability issues is investigated. Two strategies are devised: a greedy active-set method where the dictionary elements are selected greedily and a convex hull approximation method that eliminates the necessity of explicitly enforcing non-penetration constraints in convex problems. Lastly, Dynamic Time Warping is studied as a possible non-linear interpolation method that permits the exploration of the non-linear manifoldm synthesising snapshots not computed in the training set with low complexity; reducing the offline costs.
- [395] arXiv:2405.20213 [pdf, ps, html, other]
-
Title: PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular OptimizationSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.
- [396] arXiv:2405.20215 [pdf, ps, html, other]
-
Title: TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language ModelsSubjects: Computation and Language (cs.CL)
Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the "TS-Align" framework, which fine-tunes a policy model using pairwise feedback data automatically mined from its outputs. This automatic mining process is efficiently accomplished through the collaboration between a large-scale teacher model and a small-scale student model. The policy fine-tuning process can be iteratively repeated using on-policy generations within our proposed teacher-student collaborative framework. Through extensive experiments, we demonstrate that our final aligned policy outperforms the base policy model with an average win rate of 69.7% across seven conversational or instruction-following datasets. Furthermore, we show that the ranking capability of the teacher is effectively distilled into the student through our pipeline, resulting in a small-scale yet effective reward model for policy model alignment.
- [397] arXiv:2405.20216 [pdf, ps, html, other]
-
Title: Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI FeedbackComments: 28 pages, 18 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.
- [398] arXiv:2405.20218 [pdf, ps, other]
-
Title: ESG-FTSE: A corpus of news articles with ESG relevance labels and use casesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We present ESG-FTSE, the first corpus comprised of news articles with Environmental, Social and Governance (ESG) relevance annotations. In recent years, investors and regulators have pushed ESG investing to the mainstream due to the urgency of climate change. This has led to the rise of ESG scores to evaluate an investment's credentials as socially responsible. While demand for ESG scores is high, their quality varies wildly. Quantitative techniques can be applied to improve ESG scores, thus, responsible investing. To contribute to resource building for ESG and financial text mining, we pioneer the ESG-FTSE corpus. We further present the first of its kind ESG annotation schema. It has three levels: a binary classification (relevant versus irrelevant news articles), ESG classification (ESG-related news articles), and target company. Both supervised and unsupervised learning experiments for ESG relevance detection were conducted to demonstrate that the corpus can be used in different settings to derive accurate ESG predictions. Keywords: corpus annotation, ESG labels, annotation schema, news article, natural language processing
- [399] arXiv:2405.20219 [pdf, ps, html, other]
-
Title: System Identification for Lithium-Ion Batteries with Nonlinear Coupled Electro-Thermal Dynamics via Bayesian OptimizationComments: 2024 American Control Conference(ACC)Subjects: Systems and Control (eess.SY)
Essential to various practical applications of lithium-ion batteries is the availability of accurate equivalent circuit models. This paper presents a new coupled electro-thermal model for batteries and studies how to extract it from data. We consider the problem of maximum likelihood parameter estimation, which, however, is nontrivial to solve as the model is nonlinear in both its dynamics and measurement. We propose to leverage the Bayesian optimization approach, owing to its machine learning-driven capability in handling complex optimization problems and searching for global optima. To enhance the parameter search efficiency, we dynamically narrow and refine the search space in Bayesian optimization. The proposed system identification approach can efficiently determine the parameters of the coupled electro-thermal model. It is amenable to practical implementation, with few requirements on the experiment, data types, and optimization setups, and well applicable to many other battery models.
- [400] arXiv:2405.20220 [pdf, ps, html, other]
-
Title: BeerReview: A Blockchain-enabled Peer Review PlatformSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computers and Society (cs.CY)
In an era of increasing concerns over intellectual property rights, traditional peer review systems face challenges including plagiarism, malicious attacks, and unauthorized data access. BeerReview, a blockchain-enabled peer review platform, offers a robust solution, enabling experts and scholars to participate actively in the review process without concerns about plagiarism or security threats. Following the completion of its alpha testing, BeerReview demonstrates the potential for expanded deployment. This platform offers improved convenience and more robust intellectual property protection within the peer review process with open source initiative.
- [401] arXiv:2405.20222 [pdf, ps, html, other]
-
Title: MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.
- [402] arXiv:2405.20224 [pdf, ps, html, other]
-
Title: EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry ImagesComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D scene reconstruction and novel view synthesis. However, its training heavily depends on high-quality, sharp images and accurate camera poses. Fulfilling these requirements can be challenging in non-ideal real-world scenarios, where motion-blurred images are commonly encountered in high-speed moving cameras or low-light environments that require long exposure times. To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that integrates event streams captured by an event camera to assist in reconstructing high-quality 3D-GS from blurry images. Capitalizing on the high temporal resolution and dynamic range offered by the event camera, we leverage the event streams to explicitly model the formation process of motion-blurred images and guide the deblurring reconstruction of 3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion trajectories during the exposure time, our method can robustly facilitate the acquisition of high-fidelity novel views with intricate texture details. We comprehensively evaluated our method and compared it with previous state-of-the-art deblurring rendering methods. Both qualitative and quantitative comparisons demonstrate that our method surpasses existing techniques in restoring fine details from blurry images and producing high-fidelity novel views.
- [403] arXiv:2405.20230 [pdf, ps, html, other]
-
Title: Feature Fusion for Improved Classification: Combining Dempster-Shafer Theory and Multiple CNN ArchitecturesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Addressing uncertainty in Deep Learning (DL) is essential, as it enables the development of models that can make reliable predictions and informed decisions in complex, real-world environments where data may be incomplete or ambiguous. This paper introduces a novel algorithm leveraging Dempster-Shafer Theory (DST) to integrate multiple pre-trained models to form an ensemble capable of providing more reliable and enhanced classifications. The main steps of the proposed method include feature extraction, mass function calculation, fusion, and expected utility calculation. Several experiments have been conducted on CIFAR-10 and CIFAR-100 datasets, demonstrating superior classification accuracy of the proposed DST-based method, achieving improvements of 5.4% and 8.4%, respectively, compared to the best individual pre-trained models. Results highlight the potential of DST as a robust framework for managing uncertainties related to data when applying DL in real-world scenarios.
- [404] arXiv:2405.20231 [pdf, ps, html, other]
-
Title: The Empirical Impact of Neural Parameter Symmetries, or Lack ThereofComments: 27 pages. Preparing code for releaseSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries -- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.
- [405] arXiv:2405.20232 [pdf, ps, html, other]
-
Title: Distributed maze exploration using multiple agents and optimal goal assignmentComments: 11 pages, 9 figuresSubjects: Multiagent Systems (cs.MA)
Robotic exploration has long captivated researchers aiming to map complex environments efficiently. Techniques such as potential fields and frontier exploration have traditionally been employed in this pursuit, primarily focusing on solitary agents. Recent advancements have shifted towards optimizing exploration efficiency through multiagent systems. However, many existing approaches overlook critical real-world factors, such as broadcast range limitations, communication costs, and coverage overlap. This paper addresses these gaps by proposing a distributed maze exploration strategy (CU-LVP) that assumes constrained broadcast ranges and utilizes Voronoi diagrams for better area partitioning. By adapting traditional multiagent methods to distributed environments with limited broadcast ranges, this study evaluates their performance across diverse maze topologies, demonstrating the efficacy and practical applicability of the proposed method. The code and experimental results supporting this study are available in the following repository: this https URL.
- [406] arXiv:2405.20233 [pdf, ps, html, other]
-
Title: Grokfast: Accelerated Grokking by Amplifying Slow GradientsComments: 15 pages, 12 figures. Project page: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at \url{this https URL}.
- [407] arXiv:2405.20234 [pdf, ps, html, other]
-
Title: Context Injection Attacks on Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) such as ChatGPT and Llama-2 have become prevalent in real-world applications, exhibiting impressive text generation performance. LLMs are fundamentally developed from a scenario where the input data remains static and lacks a clear structure. To behave interactively over time, LLM-based chat systems must integrate additional contextual information (i.e., chat history) into their inputs, following a pre-defined structure. This paper identifies how such integration can expose LLMs to misleading context from untrusted sources and fail to differentiate between system and user inputs, allowing users to inject context. We present a systematic methodology for conducting context injection attacks aimed at eliciting disallowed responses by introducing fabricated context. This could lead to illegal actions, inappropriate content, or technology misuse. Our context fabrication strategies, acceptance elicitation and word anonymization, effectively create misleading contexts that can be structured with attacker-customized prompt templates, achieving injection through malicious user messages. Comprehensive evaluations on real-world LLMs such as ChatGPT and Llama-2 confirm the efficacy of the proposed attack with success rates reaching 97%. We also discuss potential countermeasures that can be adopted for attack detection and developing more secure models. Our findings provide insights into the challenges associated with the real-world deployment of LLMs for interactive and structured data scenarios.
- [408] arXiv:2405.20245 [pdf, ps, html, other]
-
Title: Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool UseComments: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE. - [409] arXiv:2405.20247 [pdf, ps, html, other]
-
Title: KerasCV and KerasNLP: Vision and Language Power-UpsMatthew Watson, Divyashree Shivakumar Sreepathihalli, Francois Chollet, Martin Gorner, Kiranbir Sodhia, Ramesh Sampath, Tirth Patel, Haifeng Jin, Neel Kovelamudi, Gabriel Rasskin, Samaneh Saadat, Luke Wood, Chen Qian, Jonathan Bischof, Ian StenbitComments: Submitted to Journal of Machine Learning Open Source SoftwareSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
We present the Keras domain packages KerasCV and KerasNLP, extensions of the Keras API for Computer Vision and Natural Language Processing workflows, capable of running on either JAX, TensorFlow, or PyTorch. These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance. We adopt a modular, layered design: at the library's lowest level of abstraction, we provide building blocks for creating models and data preprocessing pipelines, and at the library's highest level of abstraction, we provide pretrained ``task" models for popular architectures such as Stable Diffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have built-in preprocessing, pretrained weights, and can be fine-tuned on raw inputs. To enable efficient training, we support XLA compilation for all models, and run all preprocessing via a compiled graph of TensorFlow operations using the tf.data API. The libraries are fully open-source (Apache 2.0 license) and available on GitHub.
- [410] arXiv:2405.20248 [pdf, ps, html, other]
-
Title: Image-to-Joint Inverse Kinematic of a Supportive Continuum Arm Using Deep LearningComments: Accepted in CRV2024 (this https URL)Subjects: Robotics (cs.RO)
In this work, a deep learning-based technique is used to study the image-to-joint inverse kinematics of a tendon-driven supportive continuum arm. An eye-off-hand configuration is considered by mounting a camera at a fixed pose with respect to the inertial frame attached at the arm base. This camera captures an image for each distinct joint variable at each sampling time to construct the training dataset. This dataset is then employed to adapt a feed-forward deep convolutional neural network, namely the modified VGG-16 model, to estimate the joint variable. One thousand images are recorded to train the deep network, and transfer learning and fine-tuning techniques are applied to the modified VGG-16 to further improve the training. Finally, training is also completed with a larger dataset of images that are affected by various types of noises, changes in illumination, and partial occlusion. The main contribution of this research is the development of an image-to-joint network that can estimate the joint variable given an image of the arm, even if the image is not captured in an ideal condition. The key benefits of this research are twofold: 1) image-to-joint mapping can offer a real-time alternative to computationally complex inverse kinematic mapping through analytical models; and 2) the proposed technique can provide robustness against noise, occlusion, and changes in illumination. The dataset is publicly available on Kaggle.
- [411] arXiv:2405.20252 [pdf, ps, html, other]
-
Title: Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt OptimizationSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.
- [412] arXiv:2405.20253 [pdf, ps, html, other]
-
Title: Evaluating Large Language Model Biases in Persona-Steered GenerationComments: Accepted to Findings of ACL 2024. Code and data available at this https URLSubjects: Computation and Language (cs.CL)
The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.
- [413] arXiv:2405.20254 [pdf, ps, html, other]
-
Title: Conversational Agents to Facilitate Deliberation on Harmful Content in WhatsApp GroupsComments: To appear at CSCW 2024Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
WhatsApp groups have become a hotbed for the propagation of harmful content including misinformation, hate speech, polarizing content, and rumors, especially in Global South countries. Given the platform's end-to-end encryption, moderation responsibilities lie on group admins and members, who rarely contest such content. Another approach is fact-checking, which is unscalable, and can only contest factual content (e.g., misinformation) but not subjective content (e.g., hate speech). Drawing on recent literature, we explore deliberation -- open and inclusive discussion -- as an alternative. We investigate the role of a conversational agent in facilitating deliberation on harmful content in WhatsApp groups. We conducted semi-structured interviews with 21 Indian WhatsApp users, employing a design probe to showcase an example agent. Participants expressed the need for anonymity and recommended AI assistance to reduce the effort required in deliberation. They appreciated the agent's neutrality but pointed out the futility of deliberation in echo chamber groups. Our findings highlight design tensions for such an agent, including privacy versus group dynamics and freedom of speech in private spaces. We discuss the efficacy of deliberation using deliberative theory as a lens, compare deliberation with moderation and fact-checking, and provide design recommendations for future such systems. Ultimately, this work advances CSCW by offering insights into designing deliberative systems for combating harmful content in private group chats on social media.
- [414] arXiv:2405.20259 [pdf, ps, html, other]
-
Title: FaceMixup: Enhancing Facial Expression Recognition through Mixed Face RegularizationComments: 29 pages, 9 figures, paper is under review on journalSubjects: Computer Vision and Pattern Recognition (cs.CV)
The proliferation of deep learning solutions and the scarcity of large annotated datasets pose significant challenges in real-world applications. Various strategies have been explored to overcome this challenge, with data augmentation (DA) approaches emerging as prominent solutions. DA approaches involve generating additional examples by transforming existing labeled data, thereby enriching the dataset and helping deep learning models achieve improved generalization without succumbing to overfitting. In real applications, where solutions based on deep learning are widely used, there is facial expression recognition (FER), which plays an essential role in human communication, improving a range of knowledge areas (e.g., medicine, security, and marketing). In this paper, we propose a simple and comprehensive face data augmentation approach based on mixed face component regularization that outperforms the classical DA approaches from the literature, including the MixAugment which is a specific approach for the target task in two well-known FER datasets existing in the literature.
- [415] arXiv:2405.20260 [pdf, ps, html, other]
-
Title: Ancillary Services Provision by Cross-Voltage-Level Power Flow Control using Flexibility RegionsComments: preprintSubjects: Systems and Control (eess.SY)
The large-scale integration of distributed renewable energy sources into the electricity grid requires the investigation of new methods to ensure stability. For example, Active Distribution Networks (ADNs) can be used at (sub-) transmission levels for emergency operation, provided robust and efficient control is available. This paper investigates the use of Feasible Operating Regions (FORs) and Flexibility Regions (FRs) for Cross-Voltage-Level Power Flow Control (CPFC). The enhancement of network stability due to the provision of ancillary services is illustrated, as is the need for strengthened cooperation between Transmission (TSOs) and Distribution System Operators (DSOs). Optimal power flow methods are considered, focusing on computational advances through PieceWise Linearization (PWL) and convex relaxation techniques aiming to speed up runtime while keeping high accuracy. To illustrate the algorithms' benefits and drawbacks, they are analyzed using exemplary medium voltage grids.
- [416] arXiv:2405.20261 [pdf, ps, html, other]
-
Title: Speed Profile Definition for GLOSA Implementation on Buses Based on Statistical Analysis of Experimental DataSubjects: Systems and Control (eess.SY)
Intelligent Transportation Systems (ITS) are pushing an increasing interest and development when dealing with eco-driving systems. In this framework, this paper presents a method to define speed profiles specifically designed for Green Light Optimal Speed Advisory (GLOSA) systems on buses. GLOSA aims to optimize traffic flow by providing vehicles with real-time speed recommendations synchronized with traffic signal timings. Leveraging statistical analysis of experimental data collected from an urban bus, the study develops a methodology to extract meaningful insights into bus behaviour and traffic dynamics. The proposed approach considers road topology, scheduled bus stops, and signal timings to define simple although suitable speed profiles considering the peculiarities of the motion of a bus in an urban scenario. Through extensive data collection robust statistical data are defined, allowing the definition of vehicle motion profile for effectively develop and implement GLOSA systems. This research contributes to the advancement of Intelligent Transportation Systems by providing realistic data and practical insights for optimizing bus operations in urban environments.
- [417] arXiv:2405.20267 [pdf, ps, html, other]
-
Title: Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee DiscussionsSubjects: Computation and Language (cs.CL)
As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM's true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.
- [418] arXiv:2405.20269 [pdf, ps, other]
-
Title: IsraParlTweet: The Israeli Parliamentary and Twitter ResourceComments: Presented at LREC-COLING 2024Subjects: Computation and Language (cs.CL)
We introduce IsraParlTweet, a new linked corpus of Hebrew-language parliamentary discussions from the Knesset (Israeli Parliament) between the years 1992-2023 and Twitter posts made by Members of the Knesset between the years 2008-2023, containing a total of 294.5 million Hebrew tokens. In addition to raw text, the corpus contains comprehensive metadata on speakers and Knesset sessions as well as several linguistic annotations. As a result, IsraParlTweet can be used to conduct a wide variety of quantitative and qualitative analyses and provide valuable insights into political discourse in Israel.
- [419] arXiv:2405.20271 [pdf, ps, html, other]
-
Title: ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane ReflectionsComments: Accepted to ICML 2024. Code available at this https URLSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ($\sim$$10$-$100$ times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at this https URL.
- [420] arXiv:2405.20272 [pdf, ps, html, other]
-
Title: Reconstruction Attacks on Machine Unlearning: Simple Models are VulnerableSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Machine unlearning is motivated by desire for data autonomy: a person can request to have their data's influence removed from deployed models, and those models should be updated as if they were retrained without the person's data. We show that, counter-intuitively, these updates expose individuals to high-accuracy reconstruction attacks which allow the attacker to recover their data in its entirety, even when the original models are so simple that privacy risk might not otherwise have been a concern. We show how to mount a near-perfect attack on the deleted data point from linear regression models. We then generalize our attack to other loss functions and architectures, and empirically demonstrate the effectiveness of our attacks across a wide range of datasets (capturing both tabular and image data). Our work highlights that privacy risk is significant even for extremely simple model classes when individuals can request deletion of their data from the model.
- [421] arXiv:2405.20274 [pdf, ps, html, other]
-
Title: ROAST: Review-level Opinion Aspect Sentiment Target Joint DetectionComments: arXiv admin note: text overlap with arXiv:2309.13297Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion and diversity due to various shared tasks spanning several languages and fields and organized via SemEval workshops and Germeval. Nonetheless, a few shortcomings still need to be addressed, such as the lack of low-resource language evaluations and the emphasis on sentence-level analysis. To thoroughly assess ABSA techniques in the context of complete reviews, this research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST). ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level. We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research by incorporating low-resource languages, numerous languages, and a variety of topics. Through this effort, ABSA research will be able to cover more ground and get a deeper comprehension of the task and its practical application in a variety of languages and domains (this https URL).
- [422] arXiv:2405.20277 [pdf, ps, html, other]
-
Title: Pre-train and Refine: Towards Higher Efficiency in K-Agnostic Community Detection without Quality DegradationComments: Accepted by ACM KDD 2024Subjects: Social and Information Networks (cs.SI)
Community detection (CD) is a classic graph inference task that partitions nodes of a graph into densely connected groups. While many CD methods have been proposed with either impressive quality or efficiency, balancing the two aspects remains a challenge. This study explores the potential of deep graph learning to achieve a better trade-off between the quality and efficiency of K-agnostic CD, where the number of communities K is unknown. We propose PRoCD (Pre-training & Refinement fOr Community Detection), a simple yet effective method that reformulates K-agnostic CD as the binary node pair classification. PRoCD follows a pre-training & refinement paradigm inspired by recent advances in pre-training techniques. We first conduct the offline pre-training of PRoCD on small synthetic graphs covering various topology properties. Based on the inductive inference across graphs, we then generalize the pre-trained model (with frozen parameters) to large real graphs and use the derived CD results as the initialization of an existing efficient CD method (e.g., InfoMap) to further refine the quality of CD results. In addition to benefiting from the transfer ability regarding quality, the online generalization and refinement can also help achieve high inference efficiency, since there is no time-consuming model optimization. Experiments on public datasets with various scales demonstrate that PRoCD can ensure higher efficiency in K-agnostic CD without significant quality degradation.
- [423] arXiv:2405.20278 [pdf, ps, html, other]
-
Title: Length independent generalization bounds for deep SSM architectures with stability constraintsComments: 25 pages, no figures, under submissionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Many state-of-the-art models trained on long-range sequences, for example S4, S5 or LRU, are made of sequential blocks combining State-Space Models (SSMs) with neural networks. In this paper we provide a PAC bound that holds for these kind of architectures with stable SSM blocks and does not depend on the length of the input sequence. Imposing stability of the SSM blocks is a standard practice in the literature, and it is known to help performance. Our results provide a theoretical justification for the use of stable SSM blocks as the proposed PAC bound decreases as the degree of stability of the SSM blocks increases.
- [424] arXiv:2405.20279 [pdf, ps, html, other]
-
Title: CV-VAE: A Compatible Video VAE for Latent Generative Video ModelsComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
- [425] arXiv:2405.20281 [pdf, ps, other]
-
Title: Tight Characterizations for Preprocessing against Cryptographic SaltingSubjects: Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
Cryptography often considers the strongest yet plausible attacks in the real world. Preprocessing (a.k.a. non-uniform attack) plays an important role in both theory and practice: an efficient online attacker can take advantage of advice prepared by a time-consuming preprocessing stage.
Salting is a heuristic strategy to counter preprocessing attacks by feeding a small amount of randomness to the cryptographic primitive. We present general and tight characterizations of preprocessing against cryptographic salting, with upper bounds matching the advantages of the most intuitive attack. Our result quantitatively strengthens the previous work by Coretti, Dodis, Guo, and Steinberger (EUROCRYPT'18). Our proof exploits a novel connection between the non-uniform security of salted games and direct product theorems for memoryless algorithms.
For quantum adversaries, we give similar characterizations for property finding games, resolving an open problem of the quantum non-uniform security of salted collision resistant hash by Chung, Guo, Liu, and Qian (FOCS'20). Our proof extends the compressed oracle framework of Zhandry (CRYPTO'19) to prove quantum strong direct product theorems for property finding games in the average-case hardness. - [426] arXiv:2405.20282 [pdf, ps, html, other]
-
Title: SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified FlowSubjects: Computer Vision and Pattern Recognition (cs.CV)
Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified diffusion-based framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision. Project page: this https URL.
- [427] arXiv:2405.20283 [pdf, ps, html, other]
-
Title: TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric MeshesSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We present TetSphere splatting, an explicit, Lagrangian representation for reconstructing 3D shapes with high-quality geometry. In contrast to conventional object reconstruction methods which predominantly use Eulerian representations, including both neural implicit (e.g., NeRF, NeuS) and explicit representations (e.g., DMTet), and often struggle with high computational demands and suboptimal mesh quality, TetSphere splatting utilizes an underused but highly effective geometric primitive -- tetrahedral meshes. This approach directly yields superior mesh quality without relying on neural networks or post-processing. It deforms multiple initial tetrahedral spheres to accurately reconstruct the 3D shape through a combination of differentiable rendering and geometric energy optimization, resulting in significant computational efficiency. Serving as a robust and versatile geometry representation, Tet-Sphere splatting seamlessly integrates into diverse applications, including single-view 3D reconstruction, image-/text-to-3D content generation. Experimental results demonstrate that TetSphere splatting outperforms existing representations, delivering faster optimization speed, enhanced mesh quality, and reliable preservation of thin structures.
- [428] arXiv:2405.20285 [pdf, ps, html, other]
-
Title: Who Writes the Review, Human or AI?Panagiotis C. Theocharopoulos, Spiros V. Georgakopoulos, Sotiris K. Tasoulis, Vassilis P. PlagianakosSubjects: Computation and Language (cs.CL)
With the increasing use of Artificial Intelligence in Natural Language Processing, concerns have been raised regarding the detection of AI-generated text in various domains. This study aims to investigate this issue by proposing a methodology to accurately distinguish AI-generated and human-written book reviews. Our approach utilizes transfer learning, enabling the model to identify generated text across different topics while improving its ability to detect variations in writing style and vocabulary. To evaluate the effectiveness of the proposed methodology, we developed a dataset consisting of real book reviews and AI-generated reviews using the recently proposed Vicuna open-source language model. The experimental results demonstrate that it is feasible to detect the original source of text, achieving an accuracy rate of 96.86%. Our efforts are oriented toward the exploration of the capabilities and limitations of Large Language Models in the context of text identification. Expanding our knowledge in these aspects will be valuable for effectively navigating similar models in the future and ensuring the integrity and authenticity of human-generated content.
- [429] arXiv:2405.20287 [pdf, ps, html, other]
-
Title: Flexible SE(2) graph neural networks with applications to PDE surrogatesComments: 9 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
This paper presents a novel approach for constructing graph neural networks equivariant to 2D rotations and translations and leveraging them as PDE surrogates on non-gridded domains. We show that aligning the representations with the principal axis allows us to sidestep many constraints while preserving SE(2) equivariance. By applying our model as a surrogate for fluid flow simulations and conducting thorough benchmarks against non-equivariant models, we demonstrate significant gains in terms of both data efficiency and accuracy.
- [430] arXiv:2405.20289 [pdf, ps, html, other]
-
Title: DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music GenerationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at this https URL.
- [431] arXiv:2405.20291 [pdf, ps, html, other]
-
Title: Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor ActivenessSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The security threat of backdoor attacks is a central concern for deep neural networks (DNNs). Recently, without poisoned data, unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense. Additionally, vanilla fine-tuning with those clean data can help recover the lost clean accuracy. However, the behavior of clean unlearning is still under-explored, and vanilla fine-tuning unintentionally induces back the backdoor effect. In this work, we first investigate model unlearning from the perspective of weight changes and gradient norms, and find two interesting observations in the backdoored model: 1) the weight changes between poison and clean unlearning are positively correlated, making it possible for us to identify the backdoored-related neurons without using poisoned data; 2) the neurons of the backdoored model are more active (i.e., larger changes in gradient norm) than those in the clean model, suggesting the need to suppress the gradient norm during fine-tuning. Then, we propose an effective two-stage defense method. In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1). In the second stage, based on observation 2), we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments, involving eight backdoor attacks on three benchmark datasets, demonstrate the superior performance of our proposed method compared to recent state-of-the-art backdoor defense approaches.
- [432] arXiv:2405.20299 [pdf, ps, html, other]
-
Title: Scaling White-Box Transformers for VisionComments: project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-$\alpha$, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-$\alpha$ can effectively scale with larger model sizes and datasets. For example, our CRATE-$\alpha$-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-$\alpha$-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-$\alpha$ models yield increasingly higher-quality unsupervised object segmentation of images. The project page is this https URL.
- [433] arXiv:2405.20303 [pdf, ps, other]
-
Title: Truthful Budget Aggregation: Beyond Moving-Phantom MechanismsSubjects: Computer Science and Game Theory (cs.GT)
We study a budget-aggregation setting in which a number of voters report their ideal distribution of a budget over a set of alternatives, and a mechanism aggregates these reports into an allocation. Ideally, such mechanisms are truthful, i.e., voters should not be incentivized to misreport their preferences. For the case of two alternatives, the set of mechanisms that are truthful and additionally meet a range of basic desiderata (anonymity, neutrality, and continuity) exactly coincides with the so-called moving-phantom mechanisms, but whether this space is richer for more alternatives was repeatedly stated as an open question. We answer this question in the affirmative by presenting a new mechanism that is not a moving-phantom mechanism but satisfies the four properties. Since moving-phantom mechanisms can only provide limited fairness guarantees (measured as the worst-case distance to a fair share solution), one motivation for broadening the class of truthful mechanisms is the hope for improved fairness guarantees. We dispel this hope by showing that lower bounds holding for the class of moving-phantom mechanisms extend to all truthful, anonymous, neutral, and continuous mechanisms.
- [434] arXiv:2405.20304 [pdf, ps, other]
-
Title: Group Robust Preference Optimization in Reward-free RLHFShyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija BogunovicComments: PreprintSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
- [435] arXiv:2405.20305 [pdf, ps, html, other]
-
Title: Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language ModelsComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.
- [436] arXiv:2405.20309 [pdf, ps, other]
-
Title: Large Language Models Can Self-Improve At Web Agent TasksAjay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp HochreiterSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31\% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.
- [437] arXiv:2405.20310 [pdf, ps, html, other]
-
Title: A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D ReconstructionComments: preprint, under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV)
Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically,
each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians.
In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view. - [438] arXiv:2405.20313 [pdf, ps, html, other]
-
Title: Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone GenerationGuillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey BoseComments: preprintSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.
- [439] arXiv:2405.20314 [pdf, ps, html, other]
-
Title: S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUsSubjects: Computation and Language (cs.CL)
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.
- [440] arXiv:2405.20315 [pdf, ps, html, other]
-
Title: ANAH: Analytical Annotation of Hallucinations in Large Language ModelsComments: Accepted by ACL 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Reducing the `$\textit{hallucination}$' problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present $\textbf{ANAH}$, a bilingual dataset that offers $\textbf{AN}$alytical $\textbf{A}$nnotation of $\textbf{H}$allucinations in LLMs within Generative Question Answering. Each answer sentence in our dataset undergoes rigorous annotation, involving the retrieval of a reference fragment, the judgment of the hallucination type, and the correction of hallucinated content. ANAH consists of ~12k sentence-level annotations for ~4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs progressively accumulate in the answer and use ANAH to train and evaluate hallucination annotators. We conduct extensive experiments on studying generative and discriminative annotators and show that, although current open-source LLMs have difficulties in fine-grained hallucination annotation, the generative annotator trained with ANAH can surpass all open-source LLMs and GPT-3.5, obtain performance competitive with GPT-4, and exhibits better generalization ability on unseen questions.
- [441] arXiv:2405.20318 [pdf, ps, other]
-
Title: CausalQuest: Collecting Natural Causal Questions for AI AgentsRoberto Ceraolo, Dmitrii Kharlapenko, Amélie Reymond, Rada Mihalcea, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing JinSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.
- [442] arXiv:2405.20319 [pdf, ps, html, other]
-
Title: ParSEL: Parameterized Shape Editing with LanguageSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC)
The ability to edit 3D assets from natural language presents a compelling paradigm to aid in the democratization of 3D content creation. However, while natural language is often effective at communicating general intent, it is poorly suited for specifying precise manipulation. To address this gap, we introduce ParSEL, a system that enables controllable editing of high-quality 3D assets from natural language. Given a segmented 3D mesh and an editing request, ParSEL produces a parameterized editing program. Adjusting the program parameters allows users to explore shape variations with a precise control over the magnitudes of edits. To infer editing programs which align with an input edit request, we leverage the abilities of large-language models (LLMs). However, while we find that LLMs excel at identifying initial edit operations, they often fail to infer complete editing programs, and produce outputs that violate shape semantics. To overcome this issue, we introduce Analytical Edit Propagation (AEP), an algorithm which extends a seed edit with additional operations until a complete editing program has been formed. Unlike prior methods, AEP searches for analytical editing operations compatible with a range of possible user edits through the integration of computer algebra systems for geometric analysis. Experimentally we demonstrate ParSEL's effectiveness in enabling controllable editing of 3D objects through natural language requests over alternative system designs.
- [443] arXiv:2405.20320 [pdf, ps, html, other]
-
Title: Improving the Training of Rectified FlowsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 72% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\times$64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at this https URL.
- [444] arXiv:2405.20321 [pdf, ps, html, other]
-
Title: Vision-based Manipulation from Single Human Video with Open-World Object GraphsSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. Videos can be found in the project website this https URL.
- [445] arXiv:2405.20323 [pdf, ps, html, other]
-
Title: $\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous DrivingNan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang ZhangComments: Code is available at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: this https URL.
- [446] arXiv:2405.20324 [pdf, ps, html, other]
-
Title: Don't drop your samples! Coherence-aware training benefits Conditional diffusionComments: Accepted at CVPR 2024 as a Highlight. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.
- [447] arXiv:2405.20325 [pdf, ps, html, other]
-
Title: MotionFollower: Editing Video Motion via Lightweight Score-Guided DiffusionShuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang JiangComments: 23 pages, 18 figures. Project page at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.
- [448] arXiv:2405.20327 [pdf, ps, html, other]
-
Title: GECO: Generative Image-to-3D within a SECOndComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in current methods through a two-stage approach. In the initial stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency from the multi-view prediction. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D generation with an unprecedented level of efficiency.
- [449] arXiv:2405.20330 [pdf, ps, html, other]
-
Title: 4DHands: Reconstructing Interactive Hands in 4D with TransformersComments: More demo videos can be seen at our project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transformer-based architecture with novel tokenization and feature fusion strategies. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: this https URL.
- [450] arXiv:2405.20331 [pdf, ps, html, other]
-
Title: CoSy: Evaluating Textual Explanations of NeuronsLaura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M.-C. Höhne, Kirill BykovComments: 10 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While various methods exist to connect neurons to textual descriptions of human-understandable concepts, evaluating the quality of these explanation methods presents a major challenge in the field due to a lack of unified, general-purpose quantitative evaluation. In this work, we introduce CoSy (Concept Synthesis) -- a novel, architecture-agnostic framework to evaluate the quality of textual explanations for latent neurons. Given textual explanations, our proposed framework leverages a generative model conditioned on textual input to create data points representing the textual explanation. Then, the neuron's response to these explanation data points is compared with the response to control data points, providing a quality estimate of the given explanation. We ensure the reliability of our proposed framework in a series of meta-evaluation experiments and demonstrate practical value through insights from benchmarking various concept-based textual explanation methods for Computer Vision tasks, showing that tested explanation methods significantly differ in quality.
- [451] arXiv:2405.20333 [pdf, ps, html, other]
-
Title: SurgiTrack: Fine-Grained Multi-Class Multi-Tool Tracking in Surgical VideosComments: 15 pages, 7 figures, 9 tables, 1 video. Supplementary video available at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate tool tracking is essential for the success of computer-assisted intervention. Previous efforts often modeled tool trajectories rigidly, overlooking the dynamic nature of surgical procedures, especially tracking scenarios like out-of-body and out-of-camera views. Addressing this limitation, the new CholecTrack20 dataset provides detailed labels that account for multiple tool trajectories in three perspectives: (1) intraoperative, (2) intracorporeal, and (3) visibility, representing the different types of temporal duration of tool tracks. These fine-grained labels enhance tracking flexibility but also increase the task complexity. Re-identifying tools after occlusion or re-insertion into the body remains challenging due to high visual similarity, especially among tools of the same category. This work recognizes the critical role of the tool operators in distinguishing tool track instances, especially those belonging to the same tool category. The operators' information are however not explicitly captured in surgical videos. We therefore propose SurgiTrack, a novel deep learning method that leverages YOLOv7 for precise tool detection and employs an attention mechanism to model the originating direction of the tools, as a proxy to their operators, for tool re-identification. To handle diverse tool trajectory perspectives, SurgiTrack employs a harmonizing bipartite matching graph, minimizing conflicts and ensuring accurate tool identity association. Experimental results on CholecTrack20 demonstrate SurgiTrack's effectiveness, outperforming baselines and state-of-the-art methods with real-time inference capability. This work sets a new standard in surgical tool tracking, providing dynamic trajectories for more adaptable and precise assistance in minimally invasive surgeries.
- [452] arXiv:2405.20334 [pdf, ps, html, other]
-
Title: VividDream: Generating 3D Scene with Ambient DynamicsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We introduce VividDream, a method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. VividDream first expands an input image into a static 3D point cloud through iterative inpainting and geometry merging. An ensemble of animated videos is then generated using video diffusion models with quality refinement techniques and conditioned on renderings of the static 3D scene from the sampled camera trajectories. We then optimize a canonical 4D scene representation using an animated video ensemble, with per-video motion embeddings and visibility masks to mitigate inconsistencies. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics. Experiments demonstrate that VividDream can provide human viewers with compelling 4D experiences generated based on diverse real images and text prompts.
- [453] arXiv:2405.20335 [pdf, ps, html, other]
-
Title: Xwin-LM: Strong and Scalable Alignment Practice for LLMsSubjects: Computation and Language (cs.CL)
In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository this https URL will be continually updated to foster community research.
- [454] arXiv:2405.20336 [pdf, ps, html, other]
-
Title: RapVerse: Coherent Vocals and Whole-Body Motions Generations from TextJiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang GanComments: Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at this https URL.
- [455] arXiv:2405.20337 [pdf, ps, html, other]
-
Title: OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous DrivingComments: Code is available at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: this https URL.
- [456] arXiv:2405.20338 [pdf, ps, html, other]
-
Title: Mixed finite element methods for fourth order obstacle problems in linearised elasticitySubjects: Numerical Analysis (math.NA)
This paper is devoted to the study of a novel mixed Finite Element Method for approximating the solutions of fourth order variational problems subjected to a constraint.
The first problem we consider consists in establishing the convergence of the error of the numerical approximation of the solution of a biharmonic obstacle problem. The contents of this section are meant to generalise the approach originally proposed by Ciarlet \& Raviart, and then complemented by Ciarlet \& Glowinski.
The second problem we consider amounts to studying a two-dimensional variational problem for linearly elastic shallow shells subjected to remaining confined in a prescribed half-space. We first study the case where the parametrisation of the middle surface for the linearly elastic shallow shell under consideration has non-zero curvature, and we observe that the numerical approximation of this model via a mixed Finite Element Method based on conforming elements requires the implementation of the additional constraint according to which the gradient matrix of the dual variable has to be symmetric. However, differently from the biharmonic obstacle problem previously studied, we show that the numerical implementation of this result cannot be implemented by solely resorting to Courant triangles.
Finally, we show that if the middle surface of the linearly elastic shallow shell under consideration is flat, the symmetry constraint required for formulating the constrained mixed variational problem announced in the second part of the paper is not required, and the solution can thus be approximated by solely resorting to Courant triangles.
The theoretical results we derived are complemented by a series of numerical experiments. - [457] arXiv:2405.20339 [pdf, ps, html, other]
-
Title: Visual Perception by Large Language Model's WeightsFeipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan SunSubjects: Computer Vision and Pattern Recognition (cs.CV)
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.
- [458] arXiv:2405.20340 [pdf, ps, html, other]
-
Title: MotionLLM: Understanding Human Behaviors from Human Motions and VideosComments: MotionLLM version 1.0, project page see https://lhchen.top/MotionLLMSubjects: Computer Vision and Pattern Recognition (cs.CV)
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.
- [459] arXiv:2405.20341 [pdf, ps, html, other]
-
Title: From Zero to Hero: Cold-Start Anomaly DetectionComments: ACL 2024. Our code is available at this https URLSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
When first deploying an anomaly detection system, e.g., to detect out-of-scope queries in chatbots, there are no observed data, making data-driven approaches ineffective. Zero-shot anomaly detection methods offer a solution to such "cold-start" cases, but unfortunately they are often not accurate enough. This paper studies the realistic but underexplored cold-start setting where an anomaly detection model is initialized using zero-shot guidance, but subsequently receives a small number of contaminated observations (namely, that may include anomalies). The goal is to make efficient use of both the zero-shot guidance and the observations. We propose ColdFusion, a method that effectively adapts the zero-shot anomaly detector to contaminated observations. To support future development of this new setting, we propose an evaluation suite consisting of evaluation protocols and metrics.
- [460] arXiv:2405.20343 [pdf, ps, html, other]
-
Title: Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single ImageComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.
New submissions for Friday, 31 May 2024 (showing 460 of 460 entries )
- [461] arXiv:2405.19338 (cross-list from eess.SP) [pdf, ps, other]
-
Title: Accurate Patient Alignment without Unnecessary Imaging Dose via Synthesizing Patient-specific 3D CT Images from 2D kV ImagesYuzhen Ding, Jason M. Holmes, Hongying Feng, Baoxin Li, Lisa A. McGee, Jean-Claude M. Rwigema, Sujay A. Vora, Daniel J. Ma, Robert L. Foote, Samir H. Patel, Wei LiuComments: 17 pages, 8 figures and tablesSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In radiotherapy, 2D orthogonally projected kV images are used for patient alignment when 3D-on-board imaging(OBI) unavailable. But tumor visibility is constrained due to the projection of patient's anatomy onto a 2D plane, potentially leading to substantial setup errors. In treatment room with 3D-OBI such as cone beam CT(CBCT), the field of view(FOV) of CBCT is limited with unnecessarily high imaging dose, thus unfavorable for pediatric patients. A solution to this dilemma is to reconstruct 3D CT from kV images obtained at the treatment position. Here, we propose a dual-models framework built with hierarchical ViT blocks. Unlike a proof-of-concept approach, our framework considers kV images as the solo input and can synthesize accurate, full-size 3D CT in real time(within milliseconds). We demonstrate the feasibility of the proposed approach on 10 patients with head and neck (H&N) cancer using image quality(MAE: <45HU), dosimetrical accuracy(Gamma passing rate (2%/2mm/10%)>97%) and patient position uncertainty(shift error: <0.4mm). The proposed framework can generate accurate 3D CT faithfully mirroring real-time patient position, thus significantly improving patient setup accuracy, keeping imaging dose minimum, and maintaining treatment veracity.
- [462] arXiv:2405.19340 (cross-list from eess.SP) [pdf, ps, other]
-
Title: Obtaining physical layer data of latest generation networks for investigating adversary attacksSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
The field of machine learning is developing rapidly and is being used in various fields of science and technology. In this way, machine learning can be used to optimize the functions of latest generation data networks such as 5G and 6G. This also applies to functions at a lower level. A feature of the use of machine learning in the radio path for targeted radiation generation in modern ultra-massive MIMO, reconfigurable intelligent interfaces and other technologies is the complex acquisition and processing of data from the physical layer. Additionally, adversarial measures that manipulate the behaviour of intelligent machine learning models are becoming a major concern, as many machine learning models are sensitive to incorrect input data. To obtain data on attacks directly from processing service information, a simulation model is proposed that works in conjunction with machine learning applications.
- [463] arXiv:2405.19345 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Review of Deep Representation Learning Techniques for Brain-Computer Interfaces and RecommendationsComments: Submitted to: Journal of Neural Engineering (JNE)Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
In the field of brain-computer interfaces (BCIs), the potential for leveraging deep learning techniques for representing electroencephalogram (EEG) signals has gained substantial interest. This review synthesizes empirical findings from a collection of articles using deep representation learning techniques for BCI decoding, to provide a comprehensive analysis of the current state-of-the-art. Each article was scrutinized based on three criteria: (1) the deep representation learning technique employed, (2) the underlying motivation for its utilization, and (3) the approaches adopted for characterizing the learned representations. Among the 81 articles finally reviewed in depth, our analysis reveals a predominance of 31 articles using autoencoders. We identified 13 studies employing self-supervised learning (SSL) techniques, among which ten were published in 2022 or later, attesting to the relative youth of the field. However, at the time being, none of these have led to standard foundation models that are picked up by the BCI community. Likewise, only a few studies have introspected their learned representations. We observed that the motivation in most studies for using representation learning techniques is for solving transfer learning tasks, but we also found more specific motivations such as to learn robustness or invariances, as an algorithmic bridge, or finally to uncover the structure of the data. Given the potential of foundation models to effectively tackle these challenges, we advocate for a continued dedication to the advancement of foundation models specifically designed for EEG signal decoding by using SSL techniques. We also underline the imperative of establishing specialized benchmarks and datasets to facilitate the development and continuous improvement of such foundation models.
- [464] arXiv:2405.19346 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Subject-Adaptive Transfer Learning Using Resting State EEG Signals for Cross-Subject EEG Motor Imagery ClassificationComments: Early Accepted at MICCAI 2024Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Electroencephalography (EEG) motor imagery (MI) classification is a fundamental, yet challenging task due to the variation of signals between individuals i.e., inter-subject variability. Previous approaches try to mitigate this using task-specific (TS) EEG signals from the target subject in training. However, recording TS EEG signals requires time and limits its applicability in various fields. In contrast, resting state (RS) EEG signals are a viable alternative due to ease of acquisition with rich subject information. In this paper, we propose a novel subject-adaptive transfer learning strategy that utilizes RS EEG signals to adapt models on unseen subject data. Specifically, we disentangle extracted features into task- and subject-dependent features and use them to calibrate RS EEG signals for obtaining task information while preserving subject characteristics. The calibrated signals are then used to adapt the model to the target subject, enabling the model to simulate processing TS EEG signals of the target subject. The proposed method achieves state-of-the-art accuracy on three public benchmarks, demonstrating the effectiveness of our method in cross-subject EEG MI classification. Our findings highlight the potential of leveraging RS EEG signals to advance practical brain-computer interface systems.
- [465] arXiv:2405.19347 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning ApproachSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.
- [466] arXiv:2405.19348 (cross-list from eess.SP) [pdf, ps, other]
-
Title: NERULA: A Dual-Pathway Self-Supervised Learning Framework for Electrocardiogram Signal AnalysisComments: Paper in reviewSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Electrocardiogram (ECG) signals are critical for diagnosing heart conditions and capturing detailed cardiac patterns. As wearable single-lead ECG devices become more common, efficient analysis methods are essential. We present NERULA (Non-contrastive ECG and Reconstruction Unsupervised Learning Algorithm), a self-supervised framework designed for single-lead ECG signals. NERULA's dual-pathway architecture combines ECG reconstruction and non-contrastive learning to extract detailed cardiac features. Our 50% masking strategy, using both masked and inverse-masked signals, enhances model robustness against real-world incomplete or corrupted data. The non-contrastive pathway aligns representations of masked and inverse-masked signals, while the reconstruction pathway comprehends and reconstructs missing features. We show that combining generative and discriminative paths into the training spectrum leads to better results by outperforming state-of-the-art self-supervised learning benchmarks in various tasks, demonstrating superior performance in ECG analysis, including arrhythmia classification, gender classification, age regression, and human activity recognition. NERULA's dual-pathway design offers a robust, efficient solution for comprehensive ECG signal interpretation.
- [467] arXiv:2405.19349 (cross-list from eess.SP) [pdf, ps, other]
-
Title: Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame AttentionSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.
- [468] arXiv:2405.19351 (cross-list from eess.SP) [pdf, ps, other]
-
Title: Resonate-and-Fire Spiking Neurons for Target Detection and Hand Gesture Recognition: A Hybrid ApproachSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Hand gesture recognition using radar often relies on computationally expensive fast Fourier transforms. This paper proposes an alternative approach that bypasses fast Fourier transforms using resonate-and-fire neurons. These neurons directly detect the hand in the time-domain signal, eliminating the need for fast Fourier transforms to retrieve range information. Following detection, a simple Goertzel algorithm is employed to extract five key features, eliminating the need for a second fast Fourier transform. These features are then fed into a recurrent neural network, achieving an accuracy of 98.21% for classifying five gestures. The proposed approach demonstrates competitive performance with reduced complexity compared to traditional methods
- [469] arXiv:2405.19354 (cross-list from math.GM) [pdf, ps, other]
-
Title: Rotations of G\"odel algebras with modal operatorsSubjects: General Mathematics (math.GM); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
The present paper is devoted to study the effect of connected and disconnected rotations of Gödel algebras with operators grounded on directly indecomposable structures. The structures resulting from this construction we will present are nilpotent minimum (with or without negation fixpoint, depending on whether the rotation is connected or disconnected) with special modal operators defined on a directly indecomposable algebra. In this paper we will present a (quasi-)equational definition of these latter structures. Our main results show that directly indecomposable nilpotent minimum algebras (with or without negation fixpoint) with modal operators are fully characterized as connected and disconnected rotations of directly indecomposable Gödel algebras endowed with modal operators.
- [470] arXiv:2405.19356 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: An LSTM Feature Imitation Network for Hand Movement Recognition from sEMG SignalsComments: This work has been submitted to RA-L, and under reviewSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Surface Electromyography (sEMG) is a non-invasive signal that is used in the recognition of hand movement patterns, the diagnosis of diseases, and the robust control of prostheses. Despite the remarkable success of recent end-to-end Deep Learning approaches, they are still limited by the need for large amounts of labeled data. To alleviate the requirement for big data, researchers utilize Feature Engineering, which involves decomposing the sEMG signal into several spatial, temporal, and frequency features. In this paper, we propose utilizing a feature-imitating network (FIN) for closed-form temporal feature learning over a 300ms signal window on Ninapro DB2, and applying it to the task of 17 hand movement recognition. We implement a lightweight LSTM-FIN network to imitate four standard temporal features (entropy, root mean square, variance, simple square integral). We then explore transfer learning capabilities by applying the pre-trained LSTM-FIN for tuning to a downstream hand movement recognition task. We observed that the LSTM network can achieve up to 99\% R2 accuracy in feature reconstruction and 80\% accuracy in hand movement recognition. Our results also showed that the model can be robustly applied for both within- and cross-subject movement recognition, as well as simulated low-latency environments. Overall, our work demonstrates the potential of the FIN modeling paradigm in data-scarce scenarios for sEMG signal processing.
- [471] arXiv:2405.19359 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Modally Reduced Representation Learning of Multi-Lead ECG Signals through Simultaneous Alignment and ReconstructionComments: Accepted as a Workshop Paper at TS4H@ICLR2024Journal-ref: ICLR 2024 Workshop on Learning from Time Series For HealthSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Electrocardiogram (ECG) signals, profiling the electrical activities of the heart, are used for a plethora of diagnostic applications. However, ECG systems require multiple leads or channels of signals to capture the complete view of the cardiac system, which limits their application in smartwatches and wearables. In this work, we propose a modally reduced representation learning method for ECG signals that is capable of generating channel-agnostic, unified representations for ECG signals. Through joint optimization of reconstruction and alignment, we ensure that the embeddings of the different channels contain an amalgamation of the overall information across channels while also retaining their specific information. On an independent test dataset, we generated highly correlated channel embeddings from different ECG channels, leading to a moderate approximation of the 12-lead signals from a single-channel embedding. Our generated embeddings can work as competent features for ECG signals for downstream tasks.
- [472] arXiv:2405.19363 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series ClassificationComments: 20pages (14 pages main paper + 6 pages supplementary materials)Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we introduce Medformer, a multi-granularity patching transformer tailored specifically for medical time series classification. Our method incorporates three novel mechanisms to leverage the unique characteristics of medical time series: cross-channel patching to leverage inter-channel correlations, multi-granularity embedding for capturing features at different scales, and two-stage (intra- and inter-granularity) multi-granularity self-attention for learning features and correlations within and among granularities. We conduct extensive experiments on five public datasets under both subject-dependent and challenging subject-independent setups. Results demonstrate Medformer's superiority over 10 baselines, achieving top averaged ranking across five datasets on all six evaluation metrics. These findings underscore the significant impact of our method on healthcare applications, such as diagnosing Myocardial Infarction, Alzheimer's, and Parkinson's disease. We release the source code at \url{this https URL}.
- [473] arXiv:2405.19366 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological TextSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
The utilization of deep learning on electrocardiogram (ECG) analysis has brought the advanced accuracy and efficiency of cardiac healthcare diagnostics. By leveraging the capabilities of deep learning in semantic understanding, especially in feature extraction and representation learning, this study introduces a new multimodal contrastive pretaining framework that aims to improve the quality and robustness of learned representations of 12-lead ECG signals. Our framework comprises two key components, including Cardio Query Assistant (CQA) and ECG Semantics Integrator(ESI). CQA integrates a retrieval-augmented generation (RAG) pipeline to leverage large language models (LLMs) and external medical knowledge to generate detailed textual descriptions of ECGs. The generated text is enriched with information about demographics and waveform patterns. ESI integrates both contrastive and captioning loss to pretrain ECG encoders for enhanced representations. We validate our approach through various downstream tasks, including arrhythmia detection and ECG-based subject identification. Our experimental results demonstrate substantial improvements over strong baselines in these tasks. These baselines encompass supervised and self-supervised learning methods, as well as prior multimodal pretraining approaches.
- [474] arXiv:2405.19373 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Multi-modal Mood Reader: Pre-trained Model Empowers Cross-Subject Emotion RecognitionComments: Accepted by International Conference on Neural Computing for Advanced Applications, 2024Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Emotion recognition based on Electroencephalography (EEG) has gained significant attention and diversified development in fields such as neural signal processing and affective computing. However, the unique brain anatomy of individuals leads to non-negligible natural differences in EEG signals across subjects, posing challenges for cross-subject emotion recognition. While recent studies have attempted to address these issues, they still face limitations in practical effectiveness and model framework unity. Current methods often struggle to capture the complex spatial-temporal dynamics of EEG signals and fail to effectively integrate multimodal information, resulting in suboptimal performance and limited generalizability across subjects. To overcome these limitations, we develop a Pre-trained model based Multimodal Mood Reader for cross-subject emotion recognition that utilizes masked brain signal modeling and interlinked spatial-temporal attention mechanism. The model learns universal latent representations of EEG signals through pre-training on large scale dataset, and employs Interlinked spatial-temporal attention mechanism to process Differential Entropy(DE) features extracted from EEG data. Subsequently, a multi-level fusion layer is proposed to integrate the discriminative features, maximizing the advantages of features across different dimensions and modalities. Extensive experiments on public datasets demonstrate Mood Reader's superior performance in cross-subject emotion recognition tasks, outperforming state-of-the-art methods. Additionally, the model is dissected from attention perspective, providing qualitative analysis of emotion-related brain areas, offering valuable insights for affective research in neural signal processing.
- [475] arXiv:2405.19374 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Optimal Multiclass U-Calibration Error and BeyondSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider the problem of online multiclass U-calibration, where a forecaster aims to make sequential distributional predictions over $K$ classes with low U-calibration error, that is, low regret with respect to all bounded proper losses simultaneously. Kleinberg et al. (2023) developed an algorithm with U-calibration error $O(K\sqrt{T})$ after $T$ rounds and raised the open question of what the optimal bound is. We resolve this question by showing that the optimal U-calibration error is $\Theta(\sqrt{KT})$ -- we start with a simple observation that the Follow-the-Perturbed-Leader algorithm of Daskalakis and Syrgkanis (2016) achieves this upper bound, followed by a matching lower bound constructed with a specific proper loss (which, as a side result, also proves the optimality of the algorithm of Daskalakis and Syrgkanis (2016) in the context of online learning against an adversary with finite choices). We also strengthen our results under natural assumptions on the loss functions, including $\Theta(\log T)$ U-calibration error for Lipschitz proper losses, $O(\log T)$ U-calibration error for a certain class of decomposable proper losses, U-calibration error bounds for proper losses with a low covering number, and others.
- [476] arXiv:2405.19380 (cross-list from stat.ML) [pdf, ps, other]
-
Title: Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ RegretComments: 61 pages, 6 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.
- [477] arXiv:2405.19384 (cross-list from astro-ph.EP) [pdf, ps, html, other]
-
Title: NeuralODEs for VLEO simulations: Introducing thermoNET for Thermosphere ModelingComments: Paper presented and published in the 29th ISSFD Conference, Darmstadt, GermanySubjects: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Space Physics (physics.space-ph)
We introduce a novel neural architecture termed thermoNET, designed to represent thermospheric density in satellite orbital propagation using a reduced amount of differentiable computations. Due to the appearance of a neural network on the right-hand side of the equations of motion, the resulting satellite dynamics is governed by a NeuralODE, a neural Ordinary Differential Equation, characterized by its fully differentiable nature, allowing the derivation of variational equations (hence of the state transition matrix) and facilitating its use in connection to advanced numerical techniques such as Taylor-based numerical propagation and differential algebraic techniques. Efficient training of the network parameters occurs through two distinct approaches. In the first approach, the network undergoes training independently of spacecraft dynamics, engaging in a pure regression task against ground truth models, including JB-08 and NRLMSISE-00. In the second paradigm, network parameters are learned based on observed dynamics, adapting through ODE sensitivities. In both cases, the outcome is a flexible, compact model of the thermosphere density greatly enhancing numerical propagation efficiency while maintaining accuracy in the orbital predictions.
- [478] arXiv:2405.19397 (cross-list from cond-mat.str-el) [pdf, ps, html, other]
-
Title: Ground state phases of the two-dimension electron gas with a unified variational approachSubjects: Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
The two-dimensional electron gas (2DEG) is a fundamental model, which is drawing increasing interest because of recent advances in experimental and theoretical studies of 2D materials. Current understanding of the ground state of the 2DEG relies on quantum Monte Carlo calculations, based on variational comparisons of different ansatze for different phases. We use a single variational ansatz, a general backflow-type wave function using a message-passing neural quantum state architecture, for a unified description across the entire density range. The variational optimization consistently leads to lower ground-state energies than previous best results. Transition into a Wigner crystal (WC) phase occurs automatically at rs = 37 +/- 1, a density lower than currently believed. Between the liquid and WC phases, the same ansatz and variational search strongly suggest the existence of intermediate states in a broad range of densities, with enhanced short-range nematic spin correlations.
- [479] arXiv:2405.19398 (cross-list from hep-th) [pdf, ps, html, other]
-
Title: Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless LimitComments: 51 pages, 3 figuresSubjects: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
Many machine learning models based on neural networks exhibit scaling laws: their performance scales as power laws with respect to the sizes of the model and training data set. We use large-N field theory methods to solve a model recently proposed by Maloney, Roberts and Sully which provides a simplified setting to study neural scaling laws. Our solution extends the result in this latter paper to general nonzero values of the ridge parameter, which are essential to regularize the behavior of the model. In addition to obtaining new and more precise scaling laws, we also uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes. The same duality underlies recent efforts to design neural networks to simulate quantum field theories.
- [480] arXiv:2405.19463 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Optimization and Control (math.OC)
We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order $\mathcal{O}(\log T/T)$ and $\mathcal{O}(1/T^{1-\iota})$ for any $\iota>0$, respectively under the availability of two-sample and one-sample oracles, respectively, where $T$ is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure avoids explicitly modeling and estimating the relationship between confounder and the instrumental variables, demonstrating the benefit of the proposed approach over recent works based on reformulating the problem as minimax optimization problems. Numerical experiments are provided to corroborate the theoretical results.
- [481] arXiv:2405.19486 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Online Nonparametric Supervised Learning for Massive DataComments: 24 pages, 10 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Despite their benefits in terms of simplicity, low computational cost and data requirement, parametric machine learning algorithms, such as linear discriminant analysis, quadratic discriminant analysis or logistic regression, suffer from serious drawbacks including linearity, poor fit of features to the usually imposed normal distribution and high dimensionality. Batch kernel-based nonparametric classifier, which overcomes the linearity and normality of features constraints, represent an interesting alternative for supervised classification problem. However, it suffers from the ``curse of dimension". The problem can be alleviated by the explosive sample size in the era of big data, while large-scale data size presents some challenges in the storage of data and the calculation of the classifier. These challenges make the classical batch nonparametric classifier no longer applicable. This motivates us to develop a fast algorithm adapted to the real-time calculation of the nonparametric classifier in massive as well as streaming data frameworks. This online classifier includes two steps. First, we consider an online principle components analysis to reduce the dimension of the features with a very low computation cost. Then, a stochastic approximation algorithm is deployed to obtain a real-time calculation of the nonparametric classifier. The proposed methods are evaluated and compared to some commonly used machine learning algorithms for real-time fetal well-being monitoring. The study revealed that, in terms of accuracy, the offline (or Batch), as well as, the online classifiers are good competitors to the random forest algorithm. Moreover, we show that the online classifier gives the best trade-off accuracy/computation cost compared to the offline classifier.
- [482] arXiv:2405.19492 (cross-list from eess.IV) [pdf, ps, other]
-
Title: TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR imagesTugba Akinci D'Antonoli, Lucas K. Berger, Ashraya K. Indrakanti, Nathan Vishwanathan, Jakob Weiß, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, Alexandra Walter, Elmar M. Merkle, Martin Segeroth, Joshy Cyriac, Shan Yang, Jakob WasserthalSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Purpose: To develop an open-source and easy-to-use segmentation model that can automatically and robustly segment most major anatomical structures in MR images independently of the MR sequence.
Materials and Methods: In this study we extended the capabilities of TotalSegmentator to MR images. 298 MR scans and 227 CT scans were used to segment 59 anatomical structures (20 organs, 18 bones, 11 muscles, 7 vessels, 3 tissue types) relevant for use cases such as organ volumetry, disease characterization, and surgical planning. The MR and CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, contrasts, echo times, repetition times, field strengths, slice thicknesses and sites). We trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model's performance.
Results: The model showed a Dice score of 0.824 (CI: 0.801, 0.842) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed two other publicly available segmentation models (Dice score, 0.824 versus 0.762; p<0.001 and 0.762 versus 0.542; p<0.001). On the CT image test set of the original TotalSegmentator paper it almost matches the performance of the original TotalSegmentator (Dice score, 0.960 versus 0.970; p<0.001).
Conclusion: Our proposed model extends the capabilities of TotalSegmentator to MR images. The annotated dataset (this https URL) and open-source toolkit (this https URL) are publicly available. - [483] arXiv:2405.19495 (cross-list from quant-ph) [pdf, ps, html, other]
-
Title: Qiskit Code Assistant: Training LLMs for generating Quantum Computing CodeNicolas Dupuis, Luca Buratti, Sanjay Vishwakarma, Aitana Viudes Forrat, David Kremer, Ismael Faro, Ruchir Puri, Juan Cruz-BenitoSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
Code Large Language Models (Code LLMs) have emerged as powerful tools, revolutionizing the software development landscape by automating the coding process and reducing time and effort required to build applications. This paper focuses on training Code LLMs to specialize in the field of quantum computing. We begin by discussing the unique needs of quantum computing programming, which differ significantly from classical programming approaches or languages. A Code LLM specializing in quantum computing requires a foundational understanding of quantum computing and quantum information theory. However, the scarcity of available quantum code examples and the rapidly evolving field, which necessitates continuous dataset updates, present significant challenges. Moreover, we discuss our work on training Code LLMs to produce high-quality quantum code using the Qiskit library. This work includes an examination of the various aspects of the LLMs used for training and the specific training conditions, as well as the results obtained with our current models. To evaluate our models, we have developed a custom benchmark, similar to HumanEval, which includes a set of tests specifically designed for the field of quantum computing programming using Qiskit. Our findings indicate that our model outperforms existing state-of-the-art models in quantum computing tasks. We also provide examples of code suggestions, comparing our model to other relevant code LLMs. Finally, we introduce a discussion on the potential benefits of Code LLMs for quantum computing computational scientists, researchers, and practitioners. We also explore various features and future work that could be relevant in this context.
- [484] arXiv:2405.19497 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Gaussian Flow Bridges for Audio Domain Transfer with Unpaired DataComments: Submitted to IWAENC 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Audio domain transfer is the process of modifying audio signals to match characteristics of a different domain, while retaining the original content. This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem. The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows. The proposed framework facilitates manipulation of the target distribution properties through a continuous control variable, which defines a certain aspect of the target domain. Notably, this approach does not rely on paired examples for training. To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise. Comparing our unsupervised method with established baselines, we find competitive performance in tasks of reverberation and distortion manipulation. Despite encoutering limitations, the intriguing results obtained in this study underscore potential for further exploration.
- [485] arXiv:2405.19516 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Enabling Visual Recognition at Radio FrequencySubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar's robust performance across 12 buildings.
- [486] arXiv:2405.19518 (cross-list from physics.ao-ph) [pdf, ps, html, other]
-
Title: Exploring the Potential of Hybrid Machine-Learning/Physics-Based Modeling for Atmospheric/Oceanic Prediction Beyond the Medium RangeSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
This paper explores the potential of a hybrid modeling approach that combines machine learning (ML) with conventional physics-based modeling for weather prediction beyond the medium range. It extends the work of Arcomano et al. (2022), which tested the approach for short- and medium-range weather prediction, and the work of Arcomano et al. (2023), which investigated its potential for climate modeling. The hybrid model used for the forecast experiments of the paper is based on the low-resolution, simplified parameterization atmospheric general circulation model (AGCM) SPEEDY. In addition to the hybridized prognostic variables of SPEEDY, the current version of the model has three purely ML-based prognostic variables. One of these is 6~h cumulative precipitation, another is the sea surface temperature, while the third is the heat content of the top 300 m deep layer of the ocean. The model has skill in predicting the El Niño cycle and its global teleconnections with precipitation for 3-7 months depending on the season. The model captures equatorial variability of the precipitation associated with Kelvin and Rossby waves and MJO. Predictions of the precipitation in the equatorial region have skill for 15 days in the East Pacific and 11.5 days in the West Pacific. Though the model has low spatial resolution, for these tasks it has prediction skill comparable to what has been published for high-resolution, purely physics-based, conventional operational forecast models.
- [487] arXiv:2405.19542 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Anatomical Region Recognition and Real-time Bone Tracking Methods by Dynamically Decoding A-Mode Ultrasound SignalsSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Robotics (cs.RO)
Accurate bone tracking is crucial for kinematic analysis in orthopedic surgery and prosthetic robotics. Traditional methods (e.g., skin markers) are subject to soft tissue artifacts, and the bone pins used in surgery introduce the risk of additional trauma and infection. For electromyography (EMG), its inability to directly measure joint angles requires complex algorithms for kinematic estimation. To address these issues, A-mode ultrasound-based tracking has been proposed as a non-invasive and safe alternative. However, this approach suffers from limited accuracy in peak detection when processing received ultrasound signals. To build a precise and real-time bone tracking approach, this paper introduces a deep learning-based method for anatomical region recognition and bone tracking using A-mode ultrasound signals, specifically focused on the knee joint. The algorithm is capable of simultaneously performing bone tracking and identifying the anatomical region where the A-mode ultrasound transducer is placed. It contains the fully connection between all encoding and decoding layers of the cascaded U-Nets to focus only on the signal region that is most likely to have the bone peak, thus pinpointing the exact location of the peak and classifying the anatomical region of the signal. The experiment showed a 97% accuracy in the classification of the anatomical regions and a precision of around 0.5$\pm$1mm under dynamic tracking conditions for various anatomical areas surrounding the knee joint. In general, this approach shows great potential beyond the traditional method, in terms of the accuracy achieved and the recognition of the anatomical region where the ultrasound has been attached as an additional functionality.
- [488] arXiv:2405.19546 (cross-list from physics.ao-ph) [pdf, ps, html, other]
-
Title: Convex Optimization of Initial Perturbations toward Quantitative Weather ControlComments: submitted to Geophysical Research LettersSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Systems and Control (eess.SY); Optimization and Control (math.OC)
We propose a convex optimization approach to determine perturbations in the initial conditions of a weather phenomenon as control inputs for quantitative weather control. We first construct a sensitivity matrix of outputs, such as accumulated precipitation, to the initial conditions, such as temperature and humidity, through sensitivity analysis of a numerical weather prediction model. We then solve a convex optimization problem to find optimal perturbations in the initial conditions to realize the desired spatial distribution of the targeting outputs. We implement the proposed method in a benchmark of a warm bubble experiment and show that it realizes desired spatial distributions of accumulated precipitation, such as a reference distribution and the reduced maximum value.
- [489] arXiv:2405.19553 (cross-list from math.ST) [pdf, ps, html, other]
-
Title: Convergence Bounds for Sequential Monte Carlo on Multimodal Distributions using Soft DecompositionSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics.
- [490] arXiv:2405.19565 (cross-list from physics.soc-ph) [pdf, ps, html, other]
-
Title: Unbending strategies shepherd cooperation and suppress extortion in spatial populationsComments: 21 pages, 6 figuresSubjects: Physics and Society (physics.soc-ph); Computer Science and Game Theory (cs.GT); Populations and Evolution (q-bio.PE)
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-determinant (ZD) strategies that are able to enforce a linear payoff relationship between them and their co-players. Here, we consider the evolutionary dynamics of always cooperate (AllC), extortionate ZD (extortioners), and unbending players in lattice populations based on the commonly used death-birth updating. Out of the class of unbending strategies, we consider a particular candidate, PSO Gambler, a machine-learning-optimized memory-one strategy, which can foster reciprocal cooperation and fairness among extortionate players. We derive analytical results under weak selection and rare mutations, including pairwise fixation probabilities and long-term frequencies of strategies. In the absence of the third unbending type, extortioners can achieve a half-half split in equilibrium with unconditional cooperators for sufficiently large extortion factors. However, the presence of unbending players fundamentally changes the dynamics and tilts the system to favor unbending cooperation. Most surprisingly, extortioners cannot dominate at all regardless of how large their extortion factor is, and the long-term frequency of unbending players is maintained almost as a constant. Our analytical method is applicable to studying the evolutionary dynamics of multiple strategies in structured populations. Our work provides insights into the interplay between network reciprocity and direct reciprocity, revealing the role of unbending strategies in enforcing fairness and suppressing extortion.
- [491] arXiv:2405.19603 (cross-list from math.CO) [pdf, ps, html, other]
-
Title: Morse and Lusternik-Schnirelmann for graphsComments: 36 pagesSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
Both Morse theory and Lusternik-Schnirelmann theory link algebra, topology and analysis in a geometric setting. The two theories can be formulated in finite geometries like graph theory or within finite abstract simplicial complexes. We work here mostly in graph theory and review the Morse inequalities b(k)-b(k-1) + ... + b(0) less of equal than c(k)-c(k-1) + ... + c(0) for the Betti numbers b(k) and the minimal number c(k) of Morse critical points of index k and the Lusternik-Schnirelmann inequalities cup+1 less or equal than cat less or equal than cri, between the algebraic cup length cup, the topological category cat and the analytic number cri counting the minimal number of critical points of a function.
- [492] arXiv:2405.19610 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Factor Augmented Tensor-on-Tensor Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
This paper studies the prediction task of tensor-on-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor'') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods.
- [493] arXiv:2405.19672 (cross-list from eess.IV) [pdf, ps, html, other]
-
Title: CRIS: Collaborative Refinement Integrated with Segmentation for Polyp SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Accurate detection of colorectal cancer and early prevention heavily rely on precise polyp identification during gastrointestinal colonoscopy. Due to limited data, many current state-of-the-art deep learning methods for polyp segmentation often rely on post-processing of masks to reduce noise and enhance results. In this study, we propose an approach that integrates mask refinement and binary semantic segmentation, leveraging a novel collaborative training strategy that surpasses current widely-used refinement strategies. We demonstrate the superiority of our approach through comprehensive evaluation on established benchmark datasets and its successful application across various medical image segmentation architectures.
- [494] arXiv:2405.19681 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Bayesian Online Natural Gradient (BONG)Comments: 41 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
We propose a novel approach to sequential Bayesian inference based on variational Bayes. The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior (which comes from the posterior at the previous timestep); instead we can optimize just the expected log-likelihood, performing a single step of natural gradient descent starting at the prior predictive. We prove this method recovers exact Bayesian inference if the model is conjugate, and empirically outperforms other online VB methods in the non-conjugate setting, such as online learning for neural networks, especially when controlling for computational costs.
- [495] arXiv:2405.19697 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexityComments: 43 pages, 1 figure, 1 tableSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we propose both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms are provable to enjoy the convergence rate $\mathcal{O}(\epsilon^{-1})$. To the best of our knowledge, this is the first time that AID-based bilevel RL gets rid of additional assumptions on the lower-level problem. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.
- [496] arXiv:2405.19704 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Enhancing Sufficient Dimension Reduction via Hellinger CorrelationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In this work, we develop a new theory and method for sufficient dimension reduction (SDR) in single-index models, where SDR is a sub-field of supervised dimension reduction based on conditional independence. Our work is primarily motivated by the recent introduction of the Hellinger correlation as a dependency measure. Utilizing this measure, we develop a method capable of effectively detecting the dimension reduction subspace, complete with theoretical justification. Through extensive numerical experiments, we demonstrate that our proposed method significantly enhances and outperforms existing SDR methods. This improvement is largely attributed to our proposed method's deeper understanding of data dependencies and the refinement of existing SDR techniques.
- [497] arXiv:2405.19725 (cross-list from quant-ph) [pdf, ps, html, other]
-
Title: Quantum Visual Feature Encoding RevisitedSubjects: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
Although quantum machine learning has been introduced for a while, its applications in computer vision are still limited. This paper, therefore, revisits the quantum visual encoding strategies, the initial step in quantum machine learning. Investigating the root cause, we uncover that the existing quantum encoding design fails to ensure information preservation of the visual features after the encoding process, thus complicating the learning process of the quantum machine learning models. In particular, the problem, termed "Quantum Information Gap" (QIG), leads to a gap of information between classical and corresponding quantum features. We provide theoretical proof and practical demonstrations of that found and underscore the significance of QIG, as it directly impacts the performance of quantum machine learning algorithms. To tackle this challenge, we introduce a simple but efficient new loss function named Quantum Information Preserving (QIP) to minimize this gap, resulting in enhanced performance of quantum machine learning algorithms. Extensive experiments validate the effectiveness of our approach, showcasing superior performance compared to current methodologies and consistently achieving state-of-the-art results in quantum modeling.
- [498] arXiv:2405.19760 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Identifiability of a statistical model with two latent vectors: Importance of the dimensionality relation and application to graph embeddingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emph{linear} ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results
- [499] arXiv:2405.19889 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Deep Joint Semantic Coding and Beamforming for Near-Space Airship-Borne Massive MIMO NetworkComments: Major Revision by IEEE JSACSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
Near-space airship-borne communication network is recognized to be an indispensable component of the future integrated ground-air-space network thanks to airships' advantage of long-term residency at stratospheric altitudes, but it urgently needs reliable and efficient Airship-to-X link. To improve the transmission efficiency and capacity, this paper proposes to integrate semantic communication with massive multiple-input multiple-output (MIMO) technology. Specifically, we propose a deep joint semantic coding and beamforming (JSCBF) scheme for airship-based massive MIMO image transmission network in space, in which semantics from both source and channel are fused to jointly design the semantic coding and physical layer beamforming. First, we design two semantic extraction networks to extract semantics from image source and channel state information, respectively. Then, we propose a semantic fusion network that can fuse these semantics into complex-valued semantic features for subsequent physical-layer transmission. To efficiently transmit the fused semantic features at the physical layer, we then propose the hybrid data and model-driven semantic-aware beamforming networks. At the receiver, a semantic decoding network is designed to reconstruct the transmitted images. Finally, we perform end-to-end deep learning to jointly train all the modules, using the image reconstruction quality at the receivers as a metric. The proposed deep JSCBF scheme fully combines the efficient source compressibility and robust error correction capability of semantic communication with the high spectral efficiency of massive MIMO, achieving a significant performance improvement over existing approaches.
- [500] arXiv:2405.19891 (cross-list from quant-ph) [pdf, ps, other]
-
Title: Improving the Fidelity of CNOT Circuits on NISQ HardwareComments: 67 pages, 33 figures, and 9 tablesSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
We introduce an improved CNOT synthesis algorithm that considers nearest-neighbour interactions and CNOT gate error rates in noisy intermediate-scale quantum (NISQ) hardware. Compared to IBM's Qiskit compiler, it improves the fidelity of a synthesized CNOT circuit by about 2 times on average (up to 9 times). It lowers the synthesized CNOT count by a factor of 13 on average (up to a factor of 162).
Our contribution is twofold. First, we define a $\textsf{Cost}$ function by approximating the average gate fidelity $F_{avg}$. According to the simulation results, $\textsf{Cost}$ fits the error probability of a noisy CNOT circuit, $\textsf{Prob} = 1 - F_{avg}$, much tighter than the commonly used cost functions. On IBM's fake Nairobi backend, it matches $\textsf{Prob}$ to within $10^{-3}$. On other backends, it fits $\textsf{Prob}$ to within $10^{-1}$. $\textsf{Cost}$ accurately quantifies the dynamic error characteristics and shows remarkable scalability. Second, we propose a noise-aware CNOT routing algorithm, NAPermRowCol, by adapting the leading Steiner-tree-based connectivity-aware CNOT synthesis algorithms. A weighted edge is used to encode a CNOT gate error rate and $\textsf{Cost}$-instructed heuristics are applied to each reduction step. NAPermRowCol does not use ancillary qubits and is not restricted to certain initial qubit maps. Compared with algorithms that are noise-agnostic, it improves the fidelity of a synthesized CNOT circuit across varied NISQ hardware. Depending on the benchmark circuit and the IBM backend selected, it lowers the synthesized CNOT count up to $56.95\%$ compared to ROWCOL and up to $21.62\%$ compared to PermRowCol. It reduces the synthesis $\textsf{Cost}$ up to $25.71\%$ compared to ROWCOL and up to $9.12\%$ compared to PermRowCol. Our method can be extended to route a more general quantum circuit, giving a powerful new tool for compiling on NISQ devices.