Computer Science
See recent articles
- [1] arXiv:2407.17471 [pdf, html, other]
-
Title: Real-Time Automated donning and doffing detection of PPE based on Yolov4-tinySubjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Maintaining patient safety and the safety of healthcare workers (HCWs) in hospitals and clinics highly depends on following the proper protocol for donning and taking off personal protective equipment (PPE). HCWs can benefit from a feedback system during the putting on and removal process because the process is cognitively demanding and errors are common. Centers for Disease Control and Prevention (CDC) provided guidelines for correct PPE use which should be followed. A real time object detection along with a unique sequencing algorithms are used to identify and determine the donning and doffing process in real time. The purpose of this technical research is two-fold: The user gets real time alert to the step they missed in the sequence if they don't follow the proper procedure during donning or doffing. Secondly, the use of tiny machine learning (yolov4-tiny) in embedded system architecture makes it feasible and cost-effective to deploy in different healthcare settings.
- [2] arXiv:2407.17472 [pdf, other]
-
Title: Promoting Health via mHealth Applications Using a French Version of the Mobile App Rating Scale: Adaptation and Validation StudyIna Saliasi (P2S), Prescilla Martinon (P2S), Emily Darlington (P2S), Colette Smentek (P2S), Delphine Tardivo (ADES, APHM, AMU ODONTO), Denis Bourgeois (P2S), Claude Dussart (P2S), Florence Carrouel (P2S), Laurie Fraticelli (P2S)Journal-ref: JMIR mHealth and uHealth, 2021, 9 (8), pp.e30480Subjects: Computers and Society (cs.CY)
Background In the recent decades, the number of apps promoting health behaviors and health-related strategies and interventions has increased alongside the number of smartphone users. Nevertheless, the validity process for measuring and reporting app quality remains unsatisfactory for health professionals and end users and represents a public health concern. The Mobile Application Rating Scale (MARS) is a tool validated and widely used in the scientific literature to evaluate and compare mHealth app functionalities. However, MARS is not adapted to the French culture nor to the language. Objective This study aims to translate, adapt, and validate the equivalent French version of MARS (ie, MARS-F). Methods The original MARS was first translated to French by two independent bilingual scientists, and their common version was blind back-translated twice by two native English speakers, culminating in a final well-established MARS-F. Its comprehensibility was then evaluated by 6 individuals (3 researchers and 3 nonacademics), and the final MARS-F version was created. Two bilingual raters independently completed the evaluation of 63 apps using MARS and MARS-F. Interrater reliability was assessed using intraclass correlation coefficients. In addition, internal consistency and validity of both scales were assessed. Mokken scale analysis was used to investigate the scalability of both MARS and MARS-F. Results MARS-F had a good alignment with the original MARS, with properties comparable between the two scales. The correlation coefficients (r) between the corresponding dimensions of MARS and MARS-F ranged from 0.97 to 0.99. The internal consistencies of the MARS-F dimensions engagement ($\omega$=0.79), functionality ($\omega$=0.79), esthetics ($\omega$=0.78), and information quality ($\omega$=0.61) were acceptable and that for the overall MARS score ($\omega$=0.86) was good. Mokken scale analysis revealed a strong scalability for MARS (Loevinger H=0.37) and a good scalability for MARS-F (H=0.35). Conclusions MARS-F is a valid tool, and it would serve as a crucial aid for researchers, health care professionals, public health authorities, and interested third parties, to assess the quality of mHealth apps in French-speaking countries.
- [3] arXiv:2407.17473 [pdf, html, other]
-
Title: Improving engagement, diversity, and retention in computer science with RadGrad: Results of a case studySubjects: Computers and Society (cs.CY)
RadGrad is a curriculum initiative implemented via an application that combines features of social networks, degree planners, individual learning plans, and serious games. RadGrad redefines traditional meanings of "progress" and "success" in the undergraduate computer science degree program in an attempt to improve engagement, retention, and diversity. In this paper, we describe the RadGrad Project and report on an evaluation study designed to assess the impact of RadGrad on student engagement, diversity, and retention. We also present opportunities and challenges that result from the use of the system.
- [4] arXiv:2407.17474 [pdf, html, other]
-
Title: "My Kind of Woman": Analysing Gender Stereotypes in AI through The Averageness Theory and EU LawComments: presented at IAIL 2024 the Imagining the AI Landscape After the AI ACT, in conjunction with HHAI2024, Malmö, Sweden, June 10, 2024Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
This study delves into gender classification systems, shedding light on the interaction between social stereotypes and algorithmic determinations. Drawing on the "averageness theory," which suggests a relationship between a face's attractiveness and the human ability to ascertain its gender, we explore the potential propagation of human bias into artificial intelligence (AI) systems. Utilising the AI model Stable Diffusion 2.1, we have created a dataset containing various connotations of attractiveness to test whether the correlation between attractiveness and accuracy in gender classification observed in human cognition persists within AI. Our findings indicate that akin to human dynamics, AI systems exhibit variations in gender classification accuracy based on attractiveness, mirroring social prejudices and stereotypes in their algorithmic decisions. This discovery underscores the critical need to consider the impacts of human perceptions on data collection and highlights the necessity for a multidisciplinary and intersectional approach to AI development and AI data training. By incorporating cognitive psychology and feminist legal theory, we examine how data used for AI training can foster gender diversity and fairness under the scope of the AI Act and GDPR, reaffirming how psychological and feminist legal theories can offer valuable insights for ensuring the protection of gender equality and non-discrimination in AI systems.
- [5] arXiv:2407.17475 [pdf, html, other]
-
Title: An Approach to Detect Abnormal Submissions for CodeWorkout DatasetSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Students interactions while solving problems in learning environments (i.e. log data) are often used to support students learning. For example, researchers use log data to develop systems that can provide students with personalized problem recommendations based on their knowledge level. However, anomalies in the students log data, such as cheating to solve programming problems, could introduce a hidden bias in the log data. As a result, these systems may provide inaccurate problem recommendations, and therefore, defeat their purpose. Classical cheating detection methods, such as MOSS, can be used to detect code plagiarism. However, these methods cannot detect other abnormal events such as a student gaming a system with multiple attempts of similar solutions to a particular programming problem. This paper presents a preliminary study to analyze log data with anomalies. The goal of our work is to overcome the abnormal instances when modeling personalizable recommendations in programming learning environments.
- [6] arXiv:2407.17476 [pdf, html, other]
-
Title: ORCDF: An Oversmoothing-Resistant Cognitive Diagnosis Framework for Student Learning in Online Education SystemsJournal-ref: KDD 2024Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Cognitive diagnosis models (CDMs) are designed to learn students' mastery levels using their response logs. CDMs play a fundamental role in online education systems since they significantly influence downstream applications such as teachers' guidance and computerized adaptive testing. Despite the success achieved by existing CDMs, we find that they suffer from a thorny issue that the learned students' mastery levels are too similar. This issue, which we refer to as oversmoothing, could diminish the CDMs' effectiveness in downstream tasks. CDMs comprise two core parts: learning students' mastery levels and assessing mastery levels by fitting the response logs. This paper contends that the oversmoothing issue arises from that existing CDMs seldom utilize response signals on exercises in the learning part but only use them as labels in the assessing part. To this end, this paper proposes an oversmoothing-resistant cognitive diagnosis framework (ORCDF) to enhance existing CDMs by utilizing response signals in the learning part. Specifically, ORCDF introduces a novel response graph to inherently incorporate response signals as types of edges. Then, ORCDF designs a tailored response-aware graph convolution network (RGC) that effectively captures the crucial response signals within the response graph. Via ORCDF, existing CDMs are enhanced by replacing the input embeddings with the outcome of RGC, allowing for the consideration of response signals on exercises in the learning part. Extensive experiments on real-world datasets show that ORCDF not only helps existing CDMs alleviate the oversmoothing issue but also significantly enhances the models' prediction and interpretability performance. Moreover, the effectiveness of ORCDF is validated in the downstream task of computerized adaptive testing.
- [7] arXiv:2407.17477 [pdf, other]
-
Title: Toward Automated Detection of Biased Social Signals from the Content of Clinical ConversationsFeng Chen, Manas Satish Bedmutha, Ray-Yuan Chung, Janice Sabin, Wanda Pratt, Brian R. Wood, Nadir Weibel, Andrea L. Hartzler, Trevor CohenComments: Accepted by AMIA 2024 Annual SymposiumSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
Implicit bias can impede patient-provider interactions and lead to inequities in care. Raising awareness is key to reducing such bias, but its manifestations in the social dynamics of patient-provider communication are difficult to detect. In this study, we used automated speech recognition (ASR) and natural language processing (NLP) to identify social signals in patient-provider interactions. We built an automated pipeline to predict social signals from audio recordings of 782 primary care visits that achieved 90.1% average accuracy across codes, and exhibited fairness in its predictions for white and non-white patients. Applying this pipeline, we identified statistically significant differences in provider communication behavior toward white versus non-white patients. In particular, providers expressed more patient-centered behaviors towards white patients including more warmth, engagement, and attentiveness. Our study underscores the potential of automated tools in identifying subtle communication signals that may be linked with bias and impact healthcare quality and equity.
- [8] arXiv:2407.17478 [pdf, other]
-
Title: LLM4PM: A case study on using Large Language Models for Process Modeling in Enterprise OrganizationsComments: 10 pages, 6 figuresSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
We investigate the potential of using Large Language Models (LLM) to support process model creation in organizational contexts. Specifically, we carry out a case study wherein we develop and test an LLM-based chatbot, PRODIGY (PROcess moDellIng Guidance for You), in a multinational company, the Hilti Group. We are particularly interested in understanding how LLM can aid (human) modellers in creating process flow diagrams. To this purpose, we first conduct a preliminary user study (n=10) with professional process modellers from Hilti, inquiring for various pain-points they encounter in their daily routines. Then, we use their responses to design and implement PRODIGY. Finally, we evaluate PRODIGY by letting our user study's participants use PRODIGY, and then ask for their opinion on the pros and cons of PRODIGY. We coalesce our results in actionable takeaways. Through our research, we showcase the first practical application of LLM for process modelling in the real world, shedding light on how industries can leverage LLM to enhance their Business Process Management activities.
- [9] arXiv:2407.17479 [pdf, html, other]
-
Title: The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and SpainComments: 16 pages and 10 figures in total, includes English and Spanish version, to be published in proceedings North American Chapter of the Association for Computational Linguistics Conference: LatinX in AI (LXAI) Research Workshop, Mexico City, Mexico, June 2024Subjects: Computation and Language (cs.CL)
We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.
- [10] arXiv:2407.17480 [pdf, html, other]
-
Title: Universal Approximation Theory: The basic theory for deep learning-based computer vision modelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Computer vision (CV) is one of the most crucial fields in artificial intelligence. In recent years, a variety of deep learning models based on convolutional neural networks (CNNs) and Transformers have been designed to tackle diverse problems in CV. These algorithms have found practical applications in areas such as robotics and facial recognition. Despite the increasing power of current CV models, several fundamental questions remain unresolved: Why do CNNs require deep layers? What ensures the generalization ability of CNNs? Why do residual-based networks outperform fully convolutional networks like VGG? What is the fundamental difference between residual-based CNNs and Transformer-based networks? Why can CNNs utilize LoRA and pruning techniques? The root cause of these questions lies in the lack of a robust theoretical foundation for deep learning models in CV. To address these critical issues and techniques, we employ the Universal Approximation Theorem (UAT) to provide a theoretical basis for convolution- and Transformer-based models in CV. By doing so, we aim to elucidate these questions from a theoretical perspective.
- [11] arXiv:2407.17481 [pdf, other]
-
Title: Human Oversight of Artificial Intelligence and Technical StandardisationMarion Ho-Dac (UA, CDEP), Baptiste Martinez (UA, CDEP)Comments: in French languageSubjects: Computers and Society (cs.CY)
The adoption of human oversight measures makes it possible to regulate, to varying degrees and in different ways, the decision-making process of Artificial Intelligence (AI) systems, for example by placing a human being in charge of supervising the system and, upstream, by developing the AI system to enable such supervision. Within the global governance of AI, the requirement for human oversight is embodied in several regulatory formats, within a diversity of normative sources. On the one hand, it reinforces the accountability of AI systems' users (for example, by requiring them to carry out certain checks) and, on the other hand, it better protects the individuals affected by the AI-based decision (for example, by allowing them to request a review of the decision). In the European context, the AI Act imposes obligations on providers of high-risk AI systems (and to some extent also on professional users of these systems, known as deployers), including the introduction of human oversight tools throughout the life cycle of AI systems, including by design (and their implementation by deployers). The EU legislator is therefore going much further than in the past in "spelling out" the legal requirement for human oversight. But it does not intend to provide for all implementation details; it calls on standardisation to technically flesh out this requirement (and more broadly all the requirements of section 2 of chapter III) on the basis of article 40 of the AI Act. In this multi-level regulatory context, the question of the place of humans in the AI decision-making process should be given particular attention. Indeed, depending on whether it is the law or the technical standard that sets the contours of human oversight, the "regulatory governance" of AI is not the same: its nature, content and scope are different. This analysis is at the heart of the contribution made (or to be made) by legal experts to the central reflection on the most appropriate regulatory governance -- in terms of both its institutional format and its substance -- to ensure the effectiveness of human oversight and AI trustworthiness.
- [12] arXiv:2407.17482 [pdf, other]
-
Title: Reinforcement Learning from Human Feedback: Whose Culture, Whose Values, Whose Perspectives?Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
We argue for the epistemic and ethical advantages of pluralism in Reinforcement Learning from Human Feedback (RLHF) in the context of Large Language Models (LLM). Drawing on social epistemology and pluralist philosophy of science, we suggest ways in which RHLF can be made more responsive to human needs and how we can address challenges along the way. The paper concludes with an agenda for change, i.e. concrete, actionable steps to improve LLM development.
- [13] arXiv:2407.17483 [pdf, other]
-
Title: Tackling CS education in K-12: Implementing a Google CS4HS Grant Program in a Rural Underserved AreaComments: Presented at and published in the proceedings of MICS 2017: this https URL 8 pages, 2 figuresJournal-ref: Harms, S. K. (2017). Tackling CS education in K-12: Implementing a Google CS4HS Grant Program. La Crosse, WI: 2017 Midwest Instructional Computing Symposium ProceedingsSubjects: Computers and Society (cs.CY)
Providing computer science (CS) offerings in the K-12 education system is often limited by the lack of experienced teachers, especially in small or rural underserved school districts. By helping teachers in underserved areas develop CS curriculum and helping them become certified to teach CS courses, more young people in underserved areas are aware of IT-career opportunities, and prepared for CS education at the university level, which ultimately helps tackle the IT workforce deficit in the United States.
This paper discusses a successful implementation of a Google CS4HS grant to a rural underserved area, as well as lessons learned through the implementation of the program. Key elements in the implementation included a face-to-face hands-on workshop, followed by a seven week graduate-level online summer course for the teachers to learn and develop curriculum that covers the CS concepts they will be teaching. The teachers were supported with an online community of practice for the year as they implemented the curriculum. - [14] arXiv:2407.17484 [pdf, html, other]
-
Title: A Survey of Accessible Explainable Artificial Intelligence ResearchChukwunonso Henry Nwokoye, Maria J. P. Peixoto, Akriti Pandey, Lauren Pardy, Mahadeo Sukhai, Peter R. LewisComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The increasing integration of Artificial Intelligence (AI) into everyday life makes it essential to explain AI-based decision-making in a way that is understandable to all users, including those with disabilities. Accessible explanations are crucial as accessibility in technology promotes digital inclusion and allows everyone, regardless of their physical, sensory, or cognitive abilities, to use these technologies effectively. This paper presents a systematic literature review of the research on the accessibility of Explainable Artificial Intelligence (XAI), specifically considering persons with sight loss. Our methodology includes searching several academic databases with search terms to capture intersections between XAI and accessibility. The results of this survey highlight the lack of research on Accessible XAI (AXAI) and stress the importance of including the disability community in XAI development to promote digital inclusion and accessibility and remove barriers. Most XAI techniques rely on visual explanations, such as heatmaps or graphs, which are not accessible to persons who are blind or have low vision. Therefore, it is necessary to develop explanation methods through non-visual modalities, such as auditory and tactile feedback, visual modalities accessible to persons with low vision, and personalized solutions that meet the needs of individuals, including those with multiple disabilities. We further emphasize the importance of integrating universal design principles into AI development practices to ensure that AI technologies are usable by everyone.
- [15] arXiv:2407.17486 [pdf, html, other]
-
Title: Learning from Memory: Non-Parametric Memory Augmented Self-Supervised Learning of Visual FeaturesComments: To appear in ICML 2024. Code at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This paper introduces a novel approach to improving the training stability of self-supervised learning (SSL) methods by leveraging a non-parametric memory of seen concepts. The proposed method involves augmenting a neural network with a memory component to stochastically compare current image views with previously encountered concepts. Additionally, we introduce stochastic memory blocks to regularize training and enforce consistency between image views. We extensively benchmark our method on many vision tasks, such as linear probing, transfer learning, low-shot classification, and image retrieval on many datasets. The experimental results consolidate the effectiveness of the proposed approach in achieving stable SSL training without additional regularizers while learning highly transferable representations and requiring less computing time and resources.
- [16] arXiv:2407.17487 [pdf, html, other]
-
Title: Explainable Natural Language Processing for Corporate Sustainability AnalysisKeane Ong, Rui Mao, Ranjan Satapathy, Ricardo Shirota Filho, Erik Cambria, Johan Sulaeman, Gianmarco MengaldoSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Sustainability commonly refers to entities, such as individuals, companies, and institutions, having a non-detrimental (or even positive) impact on the environment, society, and the economy. With sustainability becoming a synonym of acceptable and legitimate behaviour, it is being increasingly demanded and regulated. Several frameworks and standards have been proposed to measure the sustainability impact of corporations, including United Nations' sustainable development goals and the recently introduced global sustainability reporting framework, amongst others. However, the concept of corporate sustainability is complex due to the diverse and intricate nature of firm operations (i.e. geography, size, business activities, interlinks with other stakeholders). As a result, corporate sustainability assessments are plagued by subjectivity both within data that reflect corporate sustainability efforts (i.e. corporate sustainability disclosures) and the analysts evaluating them. This subjectivity can be distilled into distinct challenges, such as incompleteness, ambiguity, unreliability and sophistication on the data dimension, as well as limited resources and potential bias on the analyst dimension. Put together, subjectivity hinders effective cost attribution to entities non-compliant with prevailing sustainability expectations, potentially rendering sustainability efforts and its associated regulations futile. To this end, we argue that Explainable Natural Language Processing (XNLP) can significantly enhance corporate sustainability analysis. Specifically, linguistic understanding algorithms (lexical, semantic, syntactic), integrated with XAI capabilities (interpretability, explainability, faithfulness), can bridge gaps in analyst resources and mitigate subjectivity problems within data.
- [17] arXiv:2407.17488 [pdf, other]
-
Title: Untangling Cognitive Processes Underlying Knowledge WorkSubjects: Human-Computer Interaction (cs.HC)
In a post-industrial society, the workplace is dominated primarily by Knowledge Work, which is achieved mostly through human cognitive processing, such as analysis, comprehension, evaluation, and decision-making. Many of these processes have limited support from technology in the same way that physical tasks have been enabled through a host of tools from hammers to shovels and hydraulic lifts. To develop a suite of cognitive tools, we first need to understand which processes humans use to complete work tasks. In the past century several classifications (e.g., Blooms) of cognitive processes have emerged, and we assessed their viability as the basis for designing tools that support cognitive work. This study re-used an existing data set composed of interviews of environmental scientists about their core work. While the classification uncovered many instances of cognitive process, the results showed that the existing cognitive process classifications do not provide a sufficiently comprehensive deconstruction of the human cognitive processes; the work is quite simply too abstract to be operational.
- [18] arXiv:2407.17489 [pdf, html, other]
-
Title: Collective Attention in Human-AI TeamsSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
How does the presence of an AI assistant affect the collective attention of a team? We study 20 human teams of 3-4 individuals paired with one voice-only AI assistant during a challenging puzzle task. Teams are randomly assigned to an AI assistant with a human- or robotic-sounding voice that provides either helpful or misleading information about the task. Treating each individual AI interjection as a treatment intervention, we identify the causal effects of the AI on dynamic group processes involving language use. Our findings demonstrate that the AI significantly affects what teams discuss, how they discuss it, and the alignment of their mental models. Teams adopt AI-introduced language for both terms directly related to the task and for peripheral terms, even when they (a) recognize the unhelpful nature of the AI, (b) do not consider the AI a genuine team member, and (c) do not trust the AI. The process of language adaptation appears to be automatic, despite doubts about the AI's competence. The presence of an AI assistant significantly impacts team collective attention by modulating various aspects of shared cognition. This study contributes to human-AI teaming research by highlighting collective attention as a central mechanism through which AI systems in team settings influence team performance. Understanding this mechanism will help CSCW researchers design AI systems that enhance team collective intelligence by optimizing collective attention.
- [19] arXiv:2407.17490 [pdf, html, other]
-
Title: AMEX: Android Multi-annotation Expo Dataset for Mobile GUI AgentsYuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng LiSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly interacting with the graphical user interface (GUI) on mobile devices are trained and evaluated with the proposed dataset. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. Unlike existing mobile device-control datasets, e.g., MoTIF, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions, each averaging 13 steps with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we develop a baseline model SPHINX Agent and compare its performance across state-of-the-art agents trained on other datasets. To facilitate further research, we open-source our dataset, models, and relevant evaluation tools. The project is available at this https URL
- [20] arXiv:2407.17491 [pdf, html, other]
-
Title: Robust Adaptation of Foundation Models with Black-Box Visual PromptingComments: Extended work from the CVPR'23 paper: arXiv:2303.14773; This paper has been submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) for possible publicationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a sufficiently large memory capacity is equipped for caching all the intermediate activations to compute gradients. However, in most real-world applications, PTMs are served as black-box APIs or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update the Coordinator. Besides, we propose a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing.
- [21] arXiv:2407.17493 [pdf, html, other]
-
Title: ReDiFine: Reusable Diffusion Finetuning for Mitigating Degradation in the Chain of DiffusionComments: 27 pageSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Diffusion models have achieved tremendous improvements in generative modeling for images, enabling high-quality generation that is indistinguishable by humans from real images. The qualities of images have reached a threshold at which we can reuse synthetic images for training machine learning models again. This attracts the area as it can relieve the high cost of data collection and fundamentally solve many problems in data-limited areas. In this paper, we focus on a practical scenario in which pretrained text-to-image diffusion models are iteratively finetuned using a set of synthetic images, which we call the Chain of Diffusion. Finetuned models generate images that are used for the next iteration of finetuning. We first demonstrate how these iterative processes result in severe degradation in image qualities. Thorough investigations reveal the most impactful factor for the degradation, and we propose finetuning and generation strategies that can effectively resolve the degradation. Our method, Reusable Diffusion Finetuning (ReDiFine), combines condition drop finetuning and CFG scheduling to maintain the qualities of generated images throughout iterations. ReDiFine works effectively for multiple datasets and models without further hyperparameter search, making synthetic images reusable to finetune future generative models.
- [22] arXiv:2407.17494 [pdf, other]
-
Title: AI in Remote Patient MonitoringComments: A chapter for an upcoming Springer book titled "Transformation in Health Care"Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
The rapid evolution of Artificial Intelligence (AI) has significantly transformed healthcare, particularly in the domain of Remote Patient Monitoring (RPM). This chapter explores the integration of AI in RPM, highlighting real-life applications, system architectures, and the benefits it brings to patient care and healthcare systems. Through a comprehensive analysis of current technologies, methodologies, and case studies, I present a detailed overview of how AI enhances monitoring accuracy, predictive analytics, and personalized treatment plans. The chapter also discusses the challenges and future directions in this field, providing a comprehensive view of AI's role in revolutionizing remote patient care.
- [23] arXiv:2407.17495 [pdf, other]
-
Title: The Human-GenAI Value Loop in Human-Centered Innovation: Beyond the Magical NarrativeCamille Grange (1), Theophile Demazure (1), Mickael Ringeval (1), Simon Bourdeau (2) ((1) HEC Montreal, (2) UQAM)Comments: 28 pages, 3 figuresSubjects: Human-Computer Interaction (cs.HC)
Organizations across various industries are still exploring the potential of Generative Artificial Intelligence (GenAI) to enhance knowledge work. While innovation is often viewed as a product of individual creativity, it more commonly unfolds through a highly structured, collaborative process where creativity intertwines with knowledge work. However, the extent and effectiveness of GenAI in supporting this process remain open questions. Our study investigates this issue using a collaborative practice research approach focused on three GenAI-enabled innovation projects conducted over a year within three different organizations. We explored how, why, and when GenAI could be integrated into design sprints, a highly structured, collaborative, and human-centered innovation method. Our research identified challenges and opportunities in synchronizing AI capabilities with human intelligence and creativity. To translate these insights into practical strategies, we propose four recommendations for organizations eager to leverage GenAI to both streamline and bring more value to their innovation processes: (1) establish a collaborative intelligence value loop with GenAI; (2) build trust in GenAI, (3) develop robust data collection and curation workflows, and (4) cultivate a craftsmanship mindset.
- [24] arXiv:2407.17496 [pdf, other]
-
Title: Accessibility evaluation of major assistive mobile applications available for the visually impairedComments: 13 pages, 12 tables, Published in ITU Journal on Future and Evolving Technologies, Volume 4, Issue 4, December 2023, this https URLJournal-ref: ITU Journal on Future and Evolving Technologies, 4(4), 631-643 (2023)Subjects: Human-Computer Interaction (cs.HC)
People with visual impairments face numerous challenges in their daily lives, including mobility, access to information, independent living, and employment. Artificial Intelligence (AI) with Computer Vision (CV) has the potential to improve their daily lives, provide them with necessary independence, and it will also spawn new opportunities in education and employment. However, while many such AI/CV-based mobile applications are now available, these apps are still not the preferred choice amongst visually impaired persons and are generally limited to advanced users only, due to certain limitations. This study evaluates the challenges faced by visually impaired persons when using AI/CV-based mobile apps. Four popular AI/CV- based apps, namely Seeing AI, Supersense, Envision and Lookout, are assessed by blind and low-vision users. Hence these mobile applications are evaluated on a set of parameters, including generic parameters based on the Web Content Accessibility Guidelines (WCAG) and specific parameters related to mobile app testing. The evaluation not only focused on the guidelines but also on the feedback that was gathered from these users on parameters covering the apps' accuracy, response time, reliability, accessibility, privacy, energy efficiency and usability. The paper also identifies the areas of improvement in the development and innovation of these assistive apps. This work will help developers create better accessible AI-based apps for the visually impaired.
- [25] arXiv:2407.17497 [pdf, html, other]
-
Title: A Distributed Edge FLISR Solution & Network Simulation Test PlatformComments: 11 pages, 16 figures, 2 tables, submitted to IEEE Transactions on Smart GridSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The energy sector is experiencing a paradigm shift with the swift adoption of distributed energy sources, renewables, electric vehicles, and an evolving consumer-utility relationship. This necessitates the strategic integration of advanced Information and Communication Technologies (ICT) and the Internet of Things (IoT) to address the emerging challenges. Grid resilience is paramount, as a dependable energy supply is the cornerstone of societal well-being and economic activity. The primary contribution of this research is to investigate the implementation of a novel grid resiliency strategy for the Irish context, employing Fault Location, Isolation and Service Restoration (FLISR) techniques in conjunction with Edge Computing. Through a comprehensive review of existing literature, original research activities, and meticulous data analysis, we aim to develop a solution that bolsters grid resilience and mitigates the impact of service disruptions for both consumers and utilities. Additionally, our work delves into the specific context of the Irish energy grid, including relevant policies and regulations, to ensure the proposed FLISR strategy is not only effective but also readily implementable.
- [26] arXiv:2407.17498 [pdf, other]
-
Title: Antenna Model for Safe Human Exposure in Future 6G Smartphones: A Network PerspectiveJournal-ref: IEEE Transactions on Green Communications and Networking, 2024Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
In this article we present the biological effect of antenna topology on a users body. At different values of exposed frequency, the absorbent nature varies in human body. One of the major factors to be taken into consideration for designing 6G mobile antenna is the biological effect and Electromagnetic Field Exposure (EMF).
- [27] arXiv:2407.17499 [pdf, html, other]
-
Title: Sky$^\epsilon$-Tree: Embracing the Batch Updates of B$^\epsilon$-trees through Access Port Parallelism on Skyrmion Racetrack MemorySubjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
Owing to the characteristics of high density and unlimited write cycles, skyrmion racetrack memory (SK-RM) has demonstrated great potential as either the next-generation main memory or the last-level cache of processors with non-volatility. Nevertheless, the distinct skyrmion manipulations, such as injecting and shifting, demand a fundamental change in widely-used memory structures to avoid excessive energy and performance overhead. For instance, while B{\epsilon}-trees yield an excellent query and insert performance trade-off between B-trees and Log-Structured Merge (LSM)-trees, the applicability of deploying B{\epsilon}-trees onto SK-RM receives much less attention. In addition, even though optimizing designs have been proposed for B+-trees on SK-RM, those designs are not directly applicable to B{\epsilon}-trees owing to the batch update behaviors between tree nodes of B{\epsilon}-trees. Such an observation motivates us to propose the concept of Sky{\epsilon}-tree to effectively utilize the access port parallelism of SK-RM to embrace the excellent query and insert performance of B{\epsilon}-trees. Experimental results have shown promising improvements in access performance and energy conservation.
- [28] arXiv:2407.17501 [pdf, html, other]
-
Title: PatchEX: High-Quality Real-Time Temporal Supersampling through Patch-based Parallel ExtrapolationSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
High-refresh rate displays have become very popular in recent years due to the need for superior visual quality in gaming, professional displays and specialized applications like medical imaging. However, high-refresh rate displays alone do not guarantee a superior visual experience; the GPU needs to render frames at a matching rate. Otherwise, we observe disconcerting visual artifacts such as screen tearing and stuttering. Temporal supersampling is an effective technique to increase frame rates by predicting new frames from other rendered frames. There are two methods in this space: interpolation and extrapolation. Interpolation-based methods provide good image quality at the cost of a higher latency because they also require the next rendered frame. On the other hand, extrapolation methods are much faster at the cost of quality. This paper introduces PatchEX, a novel frame extrapolation method that aims to provide the quality of interpolation at the speed of extrapolation. It smartly partitions the extrapolation task into sub-tasks and executes them in parallel to improve both quality and latency. It then uses a patch-based inpainting method and a custom shadow prediction approach to fuse the generated sub-frames. This approach significantly reduces the overall latency while maintaining the quality of the output. Our results demonstrate that PatchEX achieves a 65.29% and 48.46% improvement in PSNR over the latest extrapolation methods ExtraNet and ExtraSS, respectively, while being 6x and 2x faster, respectively.
- [29] arXiv:2407.17502 [pdf, other]
-
Title: Meta-Reinforcement Learning for Universal Quadrupedal Locomotion ControlComments: The supplementary video is available at this https URLSubjects: Robotics (cs.RO)
This work presents a deep reinforcement learning-based approach to develop a policy for robot-agnostic locomotion control. Our method involves training an agent equipped with memory, implemented as a recurrent policy, on a diverse set of procedurally generated quadruped robots. We demonstrate that the policies trained by our framework transfer seamlessly to both simulated and real-world quadrupeds not encountered during training, maintaining high-quality motion across platforms. Through a series of simulation and hardware experiments, we highlight the critical role of the recurrent unit in enabling generalization, rapid adaptation to changes in the robot's dynamic properties, and sample efficiency.
- [30] arXiv:2407.17503 [pdf, html, other]
-
Title: Challenges and Considerations in Annotating Legal Data: A Comprehensive OverviewSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting text becomes a complicated task, as legal documents often have complex structures, footnotes, references, and unique terminology. The importance of data cleaning is magnified in this context, ensuring that redundant information is eliminated while maintaining crucial legal details and context. Creating comprehensive yet straightforward annotation guidelines is imperative, as these guidelines serve as the road map for maintaining uniformity and addressing the subtle nuances of legal terminology. Another critical aspect is the involvement of legal professionals in the annotation process. Their expertise is valuable in ensuring that the data not only remains contextually accurate but also adheres to prevailing legal standards and interpretations. This paper provides an expanded view of these challenges and aims to offer a foundational understanding and guidance for researchers and professionals engaged in legal data annotation projects. In addition, we provide links to our created and fine-tuned datasets and language models. These resources are outcomes of our discussed projects and solutions to challenges faced while working on them.
- [31] arXiv:2407.17508 [pdf, html, other]
-
Title: Artificial Intelligence Based Navigation in Quasi Structured EnvironmentComments: 10 pages, 8 figuresSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
The proper planning of different types of public transportation such as metro, highway, waterways, and so on, can increase the efficiency, reduce the congestion and improve the safety of the country. There are certain challenges associated with route planning, such as high cost of implementation, need for adequate resource & infrastructure and resistance to change. The goal of this research is to examine the working, applications, complexity factors, advantages & disadvantages of Floyd- Warshall, Bellman-Ford, Johnson, Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), & Grey Wolf Optimizer (GWO), to find the best choice for the above application. In this paper, comparative analysis of above-mentioned algorithms is presented. The Floyd-Warshall method and ACO algorithm are chosen based on the comparisons. Also, a combination of modified Floyd-Warshall with ACO algorithm is proposed. The proposed algorithm showed better results with less time complexity, when applied on randomly structured points within a boundary called quasi-structured points. In addition, this paper also discusses the future works of integrating Floyd-Warshall with ACO to develop a real-time model for overcoming above mentioned-challenges during transportation route planning.
- [32] arXiv:2407.17509 [pdf, other]
-
Title: Unveiling Legitimacy in the unexpected events context : An Inquiry into Information System Consultancy companies and international organizations through Topic Modeling AnalysisOussama Abidi (AMU ECO, CERGAM)Comments: Includes translation of the abstract in French language. 29e Conf{é}rence de l'Association Information et Management 27-29 mai 2024 {à} Montpellier - La Grande-Motte, May 2024, Montpellier, FranceSubjects: Computers and Society (cs.CY); Information Retrieval (cs.IR)
In an increasingly dynamic and modern market, the recurrence of unexpected events necessitates proactive responses from information system (IS) stakeholders. Each IS actor strives to legitimize its actions and communicate its strategy. This study delves into the realm of IS legitimation, focusing on the communication of two key stakeholders: IS consultancy companies and international organizations, particularly in the context of unexpected events. To achieve this objective, we examined a diverse array of publications released by both actors. Employing a topic modeling methodology, we analyzed these documents to extract valuable insights regarding their methods of legitimation. Through this research, we aim to contribute to the legitimation discourse literature by offering an exploration of two key IS stakeholders responding to the challenges posed by unexpected events.
- [33] arXiv:2407.17511 [pdf, other]
-
Title: Thermal Radiation (TR) mode: A Deployment Perspective for 5G NRJournal-ref: IEEE Potentials, 2023Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
The 5G New Radio NR technology is under standardization process by 3GPP to provide outline for a new radio interface for the next generation of cellular networks. The aim of the 5G networks include not only to provide enhanced capacity coverage but also support advanced services such as enhanced mobile broadband (eMBB) Ultra-Reliable Low Latency Communication URLLC massive Machine Type Communication mMTC. Key features of NR include Ultra lean carrier design to minimize the power consumption by limiting the always-on signal transmissions and to reduce interference in the neighboring cells . Another feature is the use of massive number of antennas for transmission as well as reception of signals. This rise in the number of antennas to provide a greater coverage brings about various challenges and impact in the system. With the increase in investigations in the mmWave frequencies, there is a need to investigate the health hazards they have on human body and the environment at large. This paper intends to provide an insight into the harmful impacts of Radio Frequency RF fields. The radiation metric to study the RF impact for far field is power density and for near field is Specific Absorption Rate SAR. These are the two main EM radiation metrics to find out the exposure due to uplink and downlink phenomenon in mobile communications. Mobile communication systems are addressed particularly to discuss the Electromagnetic EM Radiation impact as smart phones are used in close proximity to the body. A proposal in the form of Thermal Radiation TR mode is given to reduce the radiations emitted from a mobile phone. The performance of the proposed mode is validated from the results by achieving reduced power density, complexity and exposure ratio.
- [34] arXiv:2407.17512 [pdf, other]
-
Title: Green and Safe 6G Wireless Networks: A Hybrid ApproachJournal-ref: IEEE Transactions on Green Communications and Networking 2024Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
With the wireless internet access being increasingly popular with services such as HD video streaming and so on, the demand for high data consuming applications is also rising. This increment in demand is coupled with a proportional rise in the power consumption. It is required that the internet traffic is offloaded to technologies that serve the users and contribute in energy consumption. There is a need to decrease the carbon footprint in the atmosphere and also make the network safe and reliable. In this article we propose a hybrid system of RF (Radio Frequency) and VLC (Visible Light Communication) for indoor communication that can provide communication along with illumination with least power consumption. The hybrid network is viable as it utilizes power with respect to the user demand and maintains the required Quality of ServiceQoS and Quality of Experience QoE for a particular application in use. This scheme aims for Green Communication and reduction in Electromagnetic EM Radiation. A comparative analysis for RF communication, Hybrid RF+ VLC and pure VLC is made and simulations are carried out using Python, Scilab and MathWorks tool. The proposal achieves high energy efficiency of about 37% low Specific Absorption Rate (SAR) lower incident and absorbed power density complexity and temperature elevation in human body tissues exposed to the radiation. It also enhances the battery lifetime of the mobile device in use by increasing the lifetime approximately by 7 hours as validated from the obtained results. Thus the overall network reliability and safety factor is enhanced with the proposed approach.
- [35] arXiv:2407.17513 [pdf, html, other]
-
Title: Graph Linear Canonical Transform Based on CM-CC-CM DecompositionSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The graph linear canonical transform (GLCT) is presented as an extension of the graph Fourier transform (GFT) and the graph fractional Fourier transform (GFrFT), offering more flexibility as an effective tool for graph signal processing. In this paper, we introduce a GLCT based on chirp multiplication-chirp convolution-chirp multiplication decomposition (CM-CC-CM-GLCT), which irrelevant to sampling periods and without oversampling operation. Various properties and special cases of the CM-CC-CM-GLCT are derived and discussed. In terms of computational complexity, additivity, and reversibility, we compare the CM-CC-CM-GLCT and the GLCT based on the central discrete dilated Hermite function (CDDHFs-GLCT). Theoretical analysis demonstrates that the computational complexity of the CM-CC-CM-GLCT is significantly reduced. Simulation results indicate that the CM-CC-CM-GLCT achieves similar additivity to the CDDHFs-GLCT. Notably, the CM-CC-CM-GLCT exhibits better reversibility.
- [36] arXiv:2407.17515 [pdf, html, other]
-
Title: Quality Diversity for Robot Learning: Limitations and Future DirectionsComments: Accepted to GECCO 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Quality Diversity (QD) has shown great success in discovering high-performing, diverse policies for robot skill learning. While current benchmarks have led to the development of powerful QD methods, we argue that new paradigms must be developed to facilitate open-ended search and generalizability. In particular, many methods focus on learning diverse agents that each move to a different xy position in MAP-Elites-style bounded archives. Here, we show that such tasks can be accomplished with a single, goal-conditioned policy paired with a classical planner, achieving O(1) space complexity w.r.t. the number of policies and generalization to task variants. We hypothesize that this approach is successful because it extracts task-invariant structural knowledge by modeling a relational graph between adjacent cells in the archive. We motivate this view with emerging evidence from computational neuroscience and explore connections between QD and models of cognitive maps in human and other animal brains. We conclude with a discussion exploring the relationships between QD and cognitive maps, and propose future research directions inspired by cognitive maps towards future generalizable algorithms capable of truly open-ended search.
- [37] arXiv:2407.17516 [pdf, html, other]
-
Title: Amplifying the Kinematics of Origami Mechanisms With Spring JointsComments: 14 pages, 10 figuresSubjects: Robotics (cs.RO); Algebraic Geometry (math.AG)
Due to its rigid foldability and predictable kinematics, the reverse fold is the fundamental mechanism behind some of the most well known origami kinematic structures, including the Miura Ori, Yoshimura, and waterbomb patterns. However, the reverse fold only has one parameter to control its behavior: the starting fold angle. In this paper I introduce an alternative to the traditional reverse fold, based on the spring into action pattern, called the spring joint. This novel rigidly foldable mechanism is able to couple multiple reverse folds into a compact space to amplify the kinematic output of a traditional reverse fold by up to ten times, and to add one parameter for each reverse fold, giving more programmatic control of origami structures. Methods of parameterizing both the starting angle, the path of travel, and the axis of motion are also introduced. Unfortunately, this versatility comes at the cost of a large buildup of layers, making the spring joint impractical for thick origami mechanisms. To solve this problem, I also introduce a modular alternative to the spring joint that has no additional layers, with the same kinematic properties. Both of these mechanisms are tested as replacements for the reverse fold in both traditional and custom origami structures.
- [38] arXiv:2407.17518 [pdf, html, other]
-
Title: Driving pattern interpretation based on action phases clusteringSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Applications (stat.AP); Machine Learning (stat.ML)
Current approaches to identifying driving heterogeneity face challenges in comprehending fundamental patterns from the perspective of underlying driving behavior mechanisms. The concept of Action phases was proposed in our previous work, capturing the diversity of driving characteristics with physical meanings. This study presents a novel framework to further interpret driving patterns by classifying Action phases in an unsupervised manner. In this framework, a Resampling and Downsampling Method (RDM) is first applied to standardize the length of Action phases. Then the clustering calibration procedure including ''Feature Selection'', ''Clustering Analysis'', ''Difference/Similarity Evaluation'', and ''Action phases Re-extraction'' is iteratively applied until all differences among clusters and similarities within clusters reach the pre-determined criteria. Application of the framework using real-world datasets revealed six driving patterns in the I80 dataset, labeled as ''Catch up'', ''Keep away'', and ''Maintain distance'', with both ''Stable'' and ''Unstable'' states. Notably, Unstable patterns are more numerous than Stable ones. ''Maintain distance'' is the most common among Stable patterns. These observations align with the dynamic nature of driving. Two patterns ''Stable keep away'' and ''Unstable catch up'' are missing in the US101 dataset, which is in line with our expectations as this dataset was previously shown to have less heterogeneity. This demonstrates the potential of driving patterns in describing driving heterogeneity. The proposed framework promises advantages in addressing label scarcity in supervised learning and enhancing tasks such as driving behavior modeling and driving trajectory prediction.
- [39] arXiv:2407.17521 [pdf, html, other]
-
Title: CORT: Class-Oriented Real-time Tracking for Embedded SystemsSubjects: Computer Vision and Pattern Recognition (cs.CV)
The ever-increasing use of artificial intelligence in autonomous systems has significantly contributed to advance the research on multi-object tracking, adopted in several real-time applications (e.g., autonomous driving, surveillance drones, robotics) to localize and follow the trajectory of multiple objects moving in front of a camera. Current tracking algorithms can be divided into two main categories: some approaches introduce complex heuristics and re-identification models to improve the tracking accuracy and reduce the number of identification switches, without particular attention to the timing performance, whereas other approaches are aimed at reducing response times by removing the re-identification phase, thus penalizing the tracking accuracy. This work proposes a new approach to multi-class object tracking that allows achieving smaller and more predictable execution times, without penalizing the tracking performance. The idea is to reduce the problem of matching predictions with detections into smaller sub-problems by splitting the Hungarian matrix by class and invoking the second re-identification stage only when strictly necessary for a smaller number of elements. The proposed solution was evaluated in complex urban scenarios with several objects of different types (as cars, trucks, bikes, and pedestrians), showing the effectiveness of the multi-class approach with respect to state of the art trackers.
- [40] arXiv:2407.17522 [pdf, html, other]
-
Title: Mapping the Technological Future: A Topic, Sentiment, and Emotion Analysis in Social Media DiscourseSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Applications (stat.AP)
People worldwide are currently confronted with a number of technological challenges, which act as a potent source of uncertainty. The uncertainty arising from the volatility and unpredictability of technology (such as AI) and its potential consequences is widely discussed on social media. This study uses BERTopic modelling along with sentiment and emotion analysis on 1.5 million tweets from 2021 to 2023 to identify anticipated tech-driven futures and capture the emotions communicated by 400 key opinion leaders (KOLs). Findings indicate positive sentiment significantly outweighs negative, with a prevailing dominance of positive anticipatory emotions. Specifically, the 'Hope' score is approximately 10.33\% higher than the median 'Anxiety' score. KOLs emphasize 'Optimism' and benefits over 'Pessimism' and challenges. The study emphasizes the important role KOLs play in shaping future visions through anticipatory discourse and emotional tone during times of technological uncertainty.
- [41] arXiv:2407.17524 [pdf, html, other]
-
Title: StreamTinyNet: video streaming analysis with spatial-temporal TinyMLComments: this paper has been accepted and presented at the WCCI24 conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Tiny Machine Learning (TinyML) is a branch of Machine Learning (ML) that constitutes a bridge between the ML world and the embedded system ecosystem (i.e., Internet of Things devices, embedded devices, and edge computing units), enabling the execution of ML algorithms on devices constrained in terms of memory, computational capabilities, and power consumption. Video Streaming Analysis (VSA), one of the most interesting tasks of TinyML, consists in scanning a sequence of frames in a streaming manner, with the goal of identifying interesting patterns. Given the strict constraints of these tiny devices, all the current solutions rely on performing a frame-by-frame analysis, hence not exploiting the temporal component in the stream of data. In this paper, we present StreamTinyNet, the first TinyML architecture to perform multiple-frame VSA, enabling a variety of use cases that requires spatial-temporal analysis that were previously impossible to be carried out at a TinyML level. Experimental results on public-available datasets show the effectiveness and efficiency of the proposed solution. Finally, StreamTinyNet has been ported and tested on the Arduino Nicla Vision, showing the feasibility of what proposed.
- [42] arXiv:2407.17530 [pdf, html, other]
-
Title: Learning Instance-Specific Parameters of Black-Box Models Using Differentiable SurrogatesComments: 10 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Tuning parameters of a non-differentiable or black-box compute is challenging. Existing methods rely mostly on random sampling or grid sampling from the parameter space. Further, with all the current methods, it is not possible to supply any input specific parameters to the black-box. To the best of our knowledge, for the first time, we are able to learn input-specific parameters for a black box in this work. As a test application we choose a popular image denoising method BM3D as our black-box compute. Then, we use a differentiable surrogate model (a neural network) to approximate the black-box behaviour. Next, another neural network is used in an end-to-end fashion to learn input instance-specific parameters for the black-box. Drawing inspiration from the work of Tseng et al. [1] , we applied our method to the Smartphone Image Denoising Dataset (SIDD) for image denoising. The results are compelling, demonstrating a significant increase in PSNR and a notable improvement in SSIM nearing 0.93. Experimental results underscore the effectiveness of our approach in achieving substantial improvements in both model performance and optimization efficiency. For code and implementation details, please refer to our GitHub repository.
[1] Ethan Tseng, Felix Yu, Yuting Yang, Fahim Mannan, Karl St. Arnaud, Derek Nowrouzezahrai, Jean-Francois Lalonde, and Felix Heide. Hyperparameter optimization in black-box image processing using differentiable proxies. ACM Transactions on Graphics (TOG), 38(4), 7 2019. - [43] arXiv:2407.17532 [pdf, html, other]
-
Title: Generative artificial intelligence in dentistry: Current approaches and future challengesSubjects: Computation and Language (cs.CL)
Artificial intelligence (AI) has become a commodity for people because of the advent of generative AI (GenAI) models that bridge the usability gap of AI by providing a natural language interface to interact with complex models. These GenAI models range from text generation - such as two-way chat systems - to the generation of image or video from textual descriptions input by a user. These advancements in AI have impacted Dentistry in multiple aspects. In dental education, the student now has the opportunity to solve a plethora of questions by only prompting a GenAI model and have the answer in a matter of seconds. GenAI models can help us deliver better patient healthcare by helping practitioners gather knowledge quickly and efficiently. Finally, GenAI can also be used in dental research, where the applications range from new drug discovery to assistance in academic writing. In this review, we first define GenAI models and describe their multiple generation modalities; then, we explain and discuss their current and potential applications in Dentistry; and finally, we describe the challenges these new technologies impose in our area.
- [44] arXiv:2407.17533 [pdf, html, other]
-
Title: SFPrompt: Communication-Efficient Split Federated Fine-Tuning for Large Pre-Trained Models over Resource-Limited DevicesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Large pre-trained models have exhibited remarkable achievements across various domains. The substantial training costs associated with these models have led to wide studies of fine-tuning for effectively harnessing their capabilities in solving downstream tasks. Yet, conventional fine-tuning approaches become infeasible when the model lacks access to downstream data due to privacy concerns. Naively integrating fine-tuning approaches with the emerging federated learning frameworks incurs substantial communication overhead and exerts high demand on local computing resources, making it impractical for common resource-limited devices. In this paper, we introduce SFPrompt, an innovative privacy-preserving fine-tuning method tailored for the federated setting where direct uploading of raw data is prohibited and local devices are resource-constrained to run a complete pre-trained model. In essence, SFPrompt judiciously combines split learning with federated learning to handle these challenges. Specifically, the pre-trained model is first partitioned into client and server components, thereby streamlining the client-side model and substantially alleviating computational demands on local resources. SFPrompt then introduces soft prompts into the federated model to enhance the fine-tuning performance. To further reduce communication costs, a novel dataset pruning algorithm and a local-loss update strategy are devised during the fine-tuning process. Extensive experiments demonstrate that SFPrompt delivers competitive performance as the federated full fine-tuning approach while consuming a mere 0.46% of local computing resources and incurring 53% less communication cost.
- [45] arXiv:2407.17535 [pdf, other]
-
Title: LAMBDA: A Large Model Based Data AgentComments: 30 pages, 21 figures and 5 tablesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
We introduce ``LAMBDA," a novel open-source, code-free multi-agent data analysis system that that harnesses the power of large models. LAMBDA is designed to address data analysis challenges in complex data-driven applications through the use of innovatively designed data agents that operate iteratively and generatively using natural language. At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user's instructions and domain-specific knowledge, enhanced by advanced models. Meanwhile, the inspector debugs the code when necessary. To ensure robustness and handle adverse scenarios, LAMBDA features a user interface that allows direct user intervention in the operational loop. Additionally, LAMBDA can flexibly integrate external models and algorithms through our knowledge integration mechanism, catering to the needs of customized data analysis. LAMBDA has demonstrated strong performance on various machine learning datasets. It has the potential to enhance data science practice and analysis paradigm by seamlessly integrating human and artificial intelligence, making it more accessible, effective, and efficient for individuals from diverse backgrounds. The strong performance of LAMBDA in solving data science problems is demonstrated in several case studies, which are presented at \url{this https URL}.
- [46] arXiv:2407.17536 [pdf, other]
-
Title: Improved symbolic drum style classification with grammar-based hierarchical representationsLéo Géré (CNAM Paris, CEDRIC - VERTIGO), Philippe Rigaux (CEDRIC - VERTIGO, CNAM Paris), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN, LaSTIG)Comments: International Society for Music Information Retrieval Conference 2024, Nov 2024, San Francisco, United StatesSubjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.
- [47] arXiv:2407.17537 [pdf, html, other]
-
Title: A process algebraic framework for multi-agent dynamic epistemic systemsSubjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
This paper combines the classical model of labeled transition systems with the epistemic model for reasoning about knowledge. The result is a unifying framework for modeling and analyzing multi-agent, knowledge-based, dynamic systems. On the modeling side, we propose a process algebraic, agent-oriented specification language that makes such a framework easy to use for practical purposes. On the verification side, we define a modal logic encompassing temporal and epistemic operators.
- [48] arXiv:2407.17539 [pdf, html, other]
-
Title: Automated transport separation using the neural shifted proper orthogonal decompositionComments: Proceedings not peer-reviewed yet. Code available: this https URLSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
This paper presents a neural network-based methodology for the decomposition of transport-dominated fields using the shifted proper orthogonal decomposition (sPOD). Classical sPOD methods typically require an a priori knowledge of the transport operators to determine the co-moving fields. However, in many real-life problems, such knowledge is difficult or even impossible to obtain, limiting the applicability and benefits of the sPOD. To address this issue, our approach estimates both the transport and co-moving fields simultaneously using neural networks. This is achieved by training two sub-networks dedicated to learning the transports and the co-moving fields, respectively. Applications to synthetic data and a wildland fire model illustrate the capabilities and efficiency of this neural sPOD approach, demonstrating its ability to separate the different fields effectively.
- [49] arXiv:2407.17543 [pdf, html, other]
-
Title: Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task LearningComments: Submitted to MICCAI 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The influence of bias in datasets on the fairness of model predictions is a topic of ongoing research in various fields. We evaluate the performance of skin lesion classification using ResNet-based CNNs, focusing on patient sex variations in training data and three different learning strategies. We present a linear programming method for generating datasets with varying patient sex and class labels, taking into account the correlations between these variables. We evaluated the model performance using three different learning strategies: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our observations include: 1) sex-specific training data yields better results, 2) single-task models exhibit sex bias, 3) the reinforcement approach does not remove sex bias, 4) the adversarial model eliminates sex bias in cases involving only female patients, and 5) datasets that include male patients enhance model performance for the male subgroup, even when female patients are the majority. To generalise these findings, in future research, we will examine more demographic attributes, like age, and other possibly confounding factors, such as skin colour and artefacts in the skin lesions. We make all data and models available on GitHub.
- [50] arXiv:2407.17544 [pdf, html, other]
-
Title: MathViz-E: A Case-study in Domain-Specialized Tool-Using AgentsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
There has been significant recent interest in harnessing LLMs to control software systems through multi-step reasoning, planning and tool-usage. While some promising results have been obtained, application to specific domains raises several general issues including the control of specialized domain tools, the lack of existing datasets for training and evaluation, and the non-triviality of automated system evaluation and improvement. In this paper, we present a case-study where we examine these issues in the context of a specific domain. Specifically, we present an automated math visualizer and solver system for mathematical pedagogy. The system orchestrates mathematical solvers and math graphing tools to produce accurate visualizations from simple natural language commands. We describe the creation of specialized data-sets, and also develop an auto-evaluator to easily evaluate the outputs of our system by comparing them to ground-truth expressions. We have open sourced the data-sets and code for the proposed system.
- [51] arXiv:2407.17545 [pdf, html, other]
-
Title: Large Language Models for Anomaly Detection in Computational Workflows: from Supervised Fine-Tuning to In-Context LearningHongwei Jin, George Papadimitriou, Krishnan Raghavan, Pawel Zuk, Prasanna Balaprakash, Cong Wang, Anirban Mandal, Ewa DeelmanComments: 12 pages, 14 figures, paper is accepted by SC'24, source code, see: this https URLSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Anomaly detection in computational workflows is critical for ensuring system reliability and security. However, traditional rule-based methods struggle to detect novel anomalies. This paper leverages large language models (LLMs) for workflow anomaly detection by exploiting their ability to learn complex data patterns. Two approaches are investigated: 1) supervised fine-tuning (SFT), where pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies, and 2) in-context learning (ICL) where prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning. The paper evaluates the performance, efficiency, generalization of SFT models, and explores zero-shot and few-shot ICL prompts and interpretability enhancement via chain-of-thought prompting. Experiments across multiple workflow datasets demonstrate the promising potential of LLMs for effective anomaly detection in complex executions.
- [52] arXiv:2407.17546 [pdf, html, other]
-
Title: Exploring Domain Robust Lightweight Reward Models based on Router MechanismComments: This paper is accepted for ACL 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Recent advancements in large language models have heavily relied on the large reward model from reinforcement learning from human feedback for fine-tuning. However, the use of a single reward model across various domains may not always be optimal, often requiring retraining from scratch when new domain data is introduced. To address these challenges, we explore the utilization of small language models operating in a domain-specific manner based on router mechanisms. Our three approaches are: 1) utilize mixture of experts to form a single reward model by modularizing an internal router and experts, 2) employing external router to select the appropriate reward model from multiple domain-specific models, and 3) the framework reduces parameter size by loading reward models and router adapters onto a single small language model using adapters. Experimental validation underscores the effectiveness of our approach, demonstrating performance comparable to baseline methods while also reducing the total parameter size.
- [53] arXiv:2407.17569 [pdf, html, other]
-
Title: On Approximately Strategy-Proof Tournament Rules for Collusions of Size at Least ThreeSubjects: Computer Science and Game Theory (cs.GT)
A tournament organizer must select one of $n$ possible teams as the winner of a competition after observing all $\binom{n}{2}$ matches between them. The organizer would like to find a tournament rule that simultaneously satisfies the following desiderata. It must be Condorcet-consistent (henceforth, CC), meaning it selects as the winner the unique team that beats all other teams (if one exists). It must also be strongly non-manipulable for groups of size $k$ at probability $\alpha$ (henceforth, k-SNM-$\alpha$), meaning that no subset of $\leq k$ teams can fix the matches among themselves in order to increase the chances any of it's members being selected by more than $\alpha$. Our contributions are threefold. First, wee consider a natural generalization of the Randomized Single Elimination Bracket rule from [Schneider et al. 2017] to $d$-ary trees and provide upper bounds to its manipulability. Then, we propose a novel tournament rule that is CC and 3-SNM-1/2, a strict improvement upon the concurrent work of [Dinev and Weinberg, 2022] who proposed a CC and 3-SNM-31/60 rule. Finally, we initiate the study of reductions among tournament rules.
- [54] arXiv:2407.17571 [pdf, html, other]
-
Title: Diffusion Models for Multi-Task Generative ModelingChangyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z. Yao, Son Dinh Tran, Belinda ZengComments: Published as a conference paper at ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.
- [55] arXiv:2407.17572 [pdf, html, other]
-
Title: CityX: Controllable Procedural Content Generation for Unbounded 3D CitiesShougao Zhang, Mengqi Zhou, Yuxi Wang, Chuanchen Luo, Rongyu Wang, Yiwei Li, Xucheng Yin, Zhaoxiang Zhang, Junran PengComments: 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Generating a realistic, large-scale 3D virtual city remains a complex challenge due to the involvement of numerous 3D assets, various city styles, and strict layout constraints. Existing approaches provide promising attempts at procedural content generation to create large-scale scenes using Blender agents. However, they face crucial issues such as difficulties in scaling up generation capability and achieving fine-grained control at the semantic layout level. To address these problems, we propose a novel multi-modal controllable procedural content generation method, named CityX, which enhances realistic, unbounded 3D city generation guided by multiple layout conditions, including OSM, semantic maps, and satellite images. Specifically, the proposed method contains a general protocol for integrating various PCG plugins and a multi-agent framework for transforming instructions into executable Blender actions. Through this effective framework, CityX shows the potential to build an innovative ecosystem for 3D scene generation by bridging the gap between the quality of generated assets and industrial requirements. Extensive experiments have demonstrated the effectiveness of our method in creating high-quality, diverse, and unbounded cities guided by multi-modal conditions. Our project page: this https URL.
- [56] arXiv:2407.17576 [pdf, html, other]
-
Title: Time-Shifted Alternating Gelfand-Pinsker Coding for Broadcast ChannelsComments: 6 pages, 3 figures; as presented at ISIT 2024Subjects: Information Theory (cs.IT)
A coding scheme for broadcast channels (BCs) is proposed that shifts the users' code blocks by different amounts of time and applies alternating Gelfand-Pinsker encoding. The scheme achieves all rate tuples in Marton's region for two receiver BCs without time-sharing or rate-splitting. Simulations with short polar codes show that the method reduces the gap to capacity as compared to time-sharing.
- [57] arXiv:2407.17579 [pdf, html, other]
-
Title: Envisioning New Futures of Positive Social Technology: Beyond Paradigms of Fixing, Protecting, and PreventingJaeWon Kim, Lindsay Popowski, Anna Fang, Cassidy Pyle, Guo Freeman, Ryan M. Kelly, Angela Y. Lee, Fannie Liu, Angela D. R. Smith, Alexandra To, Amy X. ZhangSubjects: Human-Computer Interaction (cs.HC)
Social technology research today largely focuses on mitigating the negative impacts of technology and, therefore, often misses the potential of technology to enhance human connections and well-being. However, we see a potential to shift towards a holistic view of social technology's impact on human flourishing. We introduce Positive Social Technology (Positech), a framework that shifts emphasis toward leveraging social technologies to support and augment human flourishing. This workshop is organized around three themes relevant to Positech: 1) "Exploring Relevant and Adjacent Research" to define and widen the Positech scope with insights from related fields, 2) "Projecting the Landscape of Positech" for participants to outline the domain's key aspects and 3) "Envisioning the Future of Positech," anchored around strategic planning towards a sustainable research community. Ultimately, this workshop will serve as a platform to shift the narrative of social technology research towards a more positive, human-centric approach. It will foster research that goes beyond fixing technologies to protect humans from harm, to also pursue enriching human experiences and connections through technology.
- [58] arXiv:2407.17582 [pdf, html, other]
-
Title: Authenticated partial correction over AV-MACs: toward characterization and codingComments: 8 pages, submitted for publicationSubjects: Information Theory (cs.IT); Combinatorics (math.CO)
In this paper we study $\gamma$ partial correction over a $t$-user arbitrarily varying multiple-access channel (AV-MAC). We first present necessary channel conditions for the $\gamma$ partially correcting authentication capacity region to have nonempty interior. We then give a block length extension scheme which preserves positive rate tuples from a short code with zero probability of $\gamma$ partial correction error, noting that the flexibility of $\gamma$ partial correction prevents pure codeword concatenation from being successful. Finally, we offer a case study of a particular AV-MAC satisfying the necessary conditions for partial correction.
- [59] arXiv:2407.17584 [pdf, other]
-
Title: Theorizing neuro-induced relationships between cognitive diversity, motivation, grit and academic performance in multidisciplinary engineering education contextComments: 10 pages, 2 figuresSubjects: Computers and Society (cs.CY)
Nowadays, engineers need to tackle many unprecedented challenges that are often complex, and, most importantly, cannot be exhaustively compartmentalized into a single engineering discipline. In other words, most engineering problems need to be solved from a multidisciplinary approach. However, conventional engineering programs usually adopt pedagogical approaches specifically tailored to traditional, niched engineering disciplines, which become increasingly deviated from the industry needs as those programs are typically designed and taught by instructors with highly specialized engineering training and credentials. To reduce the gap, more multidisciplinary engineering programs emerge by systematically stretching across all engineering fibers, and challenge the sub-optimal traditional pedagogy crowded in engineering classrooms. To further advance future-oriented pedagogy, in this work, we hypothesized neuro-induced linkages between how cognitively different learners are and how the linkages would affect learners in the knowledge acquisition process. We situate the neuro-induced linkages in the context of multidisciplinary engineering education and propose possible pedagogical approaches to actualize the implications of this conceptual framework. Our study, based on the innovative concept of brain fingerprint, would serve as a pioneer model to theorize key components of learner-centered multidisciplinary engineering pedagogy which centers on the key question: how do we motivate engineering students of different backgrounds from a neuro-inspired perspective?
- [60] arXiv:2407.17585 [pdf, other]
-
Title: Quelle {\'e}thique pour quelle IA ?David Doat (ETHICS EA 7446)Comments: in French language. Workshop Ethique et Morale de la Chaire IA Responsable, Nathalie Nevejans, May 2021, Distanciel, FranceSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
This study proposes an analysis of the different types of ethical approaches involved in the ethics of AI, and situates their interests and limits. First, the author introduces to the contemporary need for and meaning of ethics. He distinguishes it from other registers of normativities and underlines its inadequacy to formalization. He then presents a cartography of the landscape of ethical theories covered by moral philosophy, taking care to distinguish meta-ethics, normative ethics and applied ethics. In drawing up this overview, the author questions the relationship between ethics and artificial intelligence. The analysis focuses in particular on the main ethical currents that have imposed themselves in the ways of doing digital ethics and AI in our Western democracies. The author asks whether these practices of ethics, as they seem to crystallize today in a precise pattern, constitute a sufficient and sufficiently satisfactory response to our needs for ethics in AI. The study concludes with a reflection on the reasons why a human ethics of AI based on a pragmatic practice of contextual ethics remains necessary and irreducible to any formalization or automated treatment of the ethical questions that arise for humans.
- [61] arXiv:2407.17586 [pdf, other]
-
Title: Big5PersonalityEssays: Introducing a Novel Synthetic Generated Dataset Consisting of Short State-of-Consciousness Essays Annotated Based on the Five Factor Model of PersonalityComments: 13 pages, 1 table, 2 figuresSubjects: Other Computer Science (cs.OH); Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
Given the high advances of large language models (LLM) it is of vital importance to study their behaviors and apply their utility in all kinds of scientific fields. Psychology has been, in recent years, poorly approached using novel computational tools. One of the reasons is the high complexity of the data required for a proper analysis. Moreover, psychology, with a focus on psychometry, has few datasets available for analysis and artificial intelligence usage. Because of these facts, this study introduces a synthethic database of short essays labeled based on the five factor model (FFM) of personality traits.
- [62] arXiv:2407.17587 [pdf, html, other]
-
Title: S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial AttacksSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Vision Transformer (ViT) is becoming widely popular in automating accurate disease diagnosis in medical imaging owing to its robust self-attention mechanism. However, ViTs remain vulnerable to adversarial attacks that may thwart the diagnosis process by leading it to intentional misclassification of critical disease. In this paper, we propose a novel image classification pipeline, namely, S-E Pipeline, that performs multiple pre-processing steps that allow ViT to be trained on critical features so as to reduce the impact of input perturbations by adversaries. Our method uses a combination of segmentation and image enhancement techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE), Unsharp Masking (UM), and High-Frequency Emphasis filtering (HFE) as preprocessing steps to identify critical features that remain intact even after adversarial perturbations. The experimental study demonstrates that our novel pipeline helps in reducing the effect of adversarial attacks by 72.22% for the ViT-b32 model and 86.58% for the ViT-l32 model. Furthermore, we have shown an end-to-end deployment of our proposed method on the NVIDIA Jetson Orin Nano board to demonstrate its practical use case in modern hand-held devices that are usually resource-constrained.
- [63] arXiv:2407.17588 [pdf, other]
-
Title: Development of Autonomous Artificial Intelligence Systems for Corporate ManagementComments: in Russian languageJournal-ref: Artificial societies. 2023. V. 18. Issue 2Subjects: Computers and Society (cs.CY)
The article discusses development of autonomous artificial intelligence systems for corporate management. The function of a corporate director is still one of the few that are legislated for execution by a "natural" rather than an "artificial" person. The main prerequisites for development of systems for full automation of management decisions made at the level of a board of directors are formed in the field of corporate law, machine learning, and compliance with the rules of non-discrimination, transparency, and accountability of decisions made and algorithms applied. The basic methodological approaches in terms of corporate law for the "autonomous director" have already been developed and do not get rejection among representatives of the legal sciences. However, there is an undeniable need for further extensive research in order to amend corporate law to effectively introduce "autonomous directors". In practice, there are two main options of management decisions automation at the level of top management and a board of directors: digital command centers or automation of separate functions. Artificial intelligence systems will be subject to the same strict requirements for non-discrimination, transparency, and accountability as "natural" directors. At a certain stage, autonomous systems can be an effective tool for countries, regions, and companies with a shortage of human capital, equalizing or providing additional chances for such countries and companies to compete on the global market.
- [64] arXiv:2407.17590 [pdf, html, other]
-
Title: Is computational creativity flourishing on the dead internet?Comments: 6 pagesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
The dead internet theory is a conspiracy theory that states that all interactions and posts on social media are no longer being made by real people, but rather by autonomous bots. While the theory is obviously not true, an increasing amount of posts on social media have been made by bots optimised to gain followers and drive engagement on social media platforms. This paper looks at the recent phenomenon of these bots, analysing their behaviour through the lens of computational creativity to investigate the question: is computational creativity flourishing on the dead internet?
- [65] arXiv:2407.17591 [pdf, other]
-
Title: Unified Prediction Model for Employability in Indian Higher Education SystemComments: 9 pagesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Educational Data Mining has become extremely popular among researchers in last decade. Prior effort in this area was only directed towards prediction of academic performance of a student. Very less number of researches are directed towards predicting employability of a student i.e. prediction of students performance in campus placements at an early stage of enrollment. Furthermore, existing researches on students employability prediction are not universal in approach and is either based upon only one type of course or University/Institute. Henceforth, is not scalable from one context to another. With the necessity of unification, data of professional technical courses namely Bachelor in Engineering/Technology and Masters in Computer Applications students have been collected from 17 states of India. To deal with such a data, a unified predictive model has been developed and applied on 17 states datasets. The research done in this paper proves that model has universal application and can be applied to various states and institutes pan India with different cultural background and course structure. This paper also explores and proves statistically that there is no significant difference in Indian Education System with respect to states as far as prediction of employability of students is concerned. Model provides a generalized solution for student employability prediction in Indian Scenario.
- [66] arXiv:2407.17596 [pdf, html, other]
-
Title: Quality Assured: Rethinking Annotation Strategies in Imaging AITim Rädsch, Annika Reinke, Vivienn Weru, Minu D. Tizabi, Nicholas Heller, Fabian Isensee, Annette Kopp-Schneider, Lena Maier-HeinComments: Accepted at ECCV 2024, preprint, Computer Vision, Data AnnotationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper does not describe a novel method. Instead, it studies an essential foundation for reliable benchmarking and ultimately real-world application of AI-based image analysis: generating high-quality reference annotations. Previous research has focused on crowdsourcing as a means of outsourcing annotations. However, little attention has so far been given to annotation companies, specifically regarding their internal quality assurance (QA) processes. Therefore, our aim is to evaluate the influence of QA employed by annotation companies on annotation quality and devise methodologies for maximizing data annotation efficacy. Based on a total of 57,648 instance segmented images obtained from a total of 924 annotators and 34 QA workers from four annotation companies and Amazon Mechanical Turk (MTurk), we derived the following insights: (1) Annotation companies perform better both in terms of quantity and quality compared to the widely used platform MTurk. (2) Annotation companies' internal QA only provides marginal improvements, if any. However, improving labeling instructions instead of investing in QA can substantially boost annotation performance. (3) The benefit of internal QA depends on specific image characteristics. Our work could enable researchers to derive substantially more value from a fixed annotation budget and change the way annotation companies conduct internal QA.
- [67] arXiv:2407.17605 [pdf, html, other]
-
Title: Coupling Speech Encoders with Downstream Text ModelsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. Our novel contribution is the use of an ``exporter'' layer that is trained under L2-loss to ensure a strong match between ASR embeddings and the MT token embeddings for the 1-best sequence. The ``exporter'' output embeddings are fed directly to the MT model in lieu of 1-best token embeddings, thus guaranteeing that the resulting model performs no worse than the 1-best cascade baseline, while allowing back-propagation gradient to flow from the MT model into the ASR components. The matched-embeddings cascade architecture provide a significant improvement over its 1-best counterpart in scenarios where incremental training of the MT model is not an option and yet we seek to improve quality by leveraging (speech, transcription, translated transcription) data provided with the AST task. The gain disappears when the MT model is incrementally trained on the parallel text data available with the AST task. The approach holds promise for other scenarios that seek to couple ASR encoders and immutable text models, such at large language models (LLM).
- [68] arXiv:2407.17611 [pdf, html, other]
-
Title: Adaptive Training of Grid-Dependent Physics-Informed Kolmogorov-Arnold NetworksSpyros Rigas, Michalis Papachristou, Theofilos Papadopoulos, Fotios Anagnostopoulos, Georgios AlexandridisSubjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Physics-Informed Neural Networks (PINNs) have emerged as a robust framework for solving Partial Differential Equations (PDEs) by approximating their solutions via neural networks and imposing physics-based constraints on the loss function. Traditionally, Multilayer Perceptrons (MLPs) are the neural network of choice, and significant progress has been made in optimizing their training. Recently, Kolmogorov-Arnold Networks (KANs) were introduced as a viable alternative, with the potential of offering better interpretability and efficiency while requiring fewer parameters. In this paper, we present a fast JAX-based implementation of grid-dependent Physics-Informed Kolmogorov-Arnold Networks (PIKANs) for solving PDEs. We propose an adaptive training scheme for PIKANs, incorporating known MLP-based PINN techniques, introducing an adaptive state transition scheme to avoid loss function peaks between grid updates, and proposing a methodology for designing PIKANs with alternative basis functions. Through comparative experiments we demonstrate that these adaptive features significantly enhance training efficiency and solution accuracy. Our results illustrate the effectiveness of PIKANs in improving performance for PDE solutions, highlighting their potential as a superior alternative in scientific and engineering applications.
- [69] arXiv:2407.17616 [pdf, html, other]
-
Title: Pretraining a Neural Operator in Lower DimensionsSubjects: Machine Learning (cs.LG)
There has recently been increasing attention towards developing foundational neural Partial Differential Equation (PDE) solvers and neural operators through large-scale pretraining. However, unlike vision and language models that make use of abundant and inexpensive (unlabeled) data for pretraining, these neural solvers usually rely on simulated PDE data, which can be costly to obtain, especially for high-dimensional PDEs. In this work, we aim to Pretrain neural PDE solvers on Lower Dimensional PDEs (PreLowD) where data collection is the least expensive. We evaluated the effectiveness of this pretraining strategy in similar PDEs in higher dimensions. We use the Factorized Fourier Neural Operator (FFNO) due to having the necessary flexibility to be applied to PDE data of arbitrary spatial dimensions and reuse trained parameters in lower dimensions. In addition, our work sheds light on the effect of the fine-tuning configuration to make the most of this pretraining strategy.
- [70] arXiv:2407.17617 [pdf, html, other]
-
Title: Adaptive Robot Detumbling of a Non-Rigid SatelliteComments: This paper has been accepted by the 63rd IEEE Conference on Decision and Control(CDC2024) as a regular paperSubjects: Robotics (cs.RO)
The challenge of satellite stabilization, particularly those with uncertain flexible dynamics, has become a pressing concern in control and robotics. These uncertainties, especially the dynamics of a third-party client satellite, significantly complicate the stabilization task. This paper introduces a novel adaptive detumbling method to handle non-rigid satellites with unknown motion dynamics (translation and rotation). The distinctive feature of our approach is that we model the non-rigid tumbling satellite as a two-link serial chain with unknown stiffness and damping in contrast to previous detumbling research works which consider the satellite a rigid body. We develop a novel adaptive robotics approach to detumble the satellite by using two space tugs as servicer despite the uncertain dynamics in the post-capture case. Notably, the stiffness properties and other physical parameters, including the mass and inertia of the two links, remain unknown to the servicer. Our proposed method addresses the challenges in detumbling tasks and paves the way for advanced manipulation of non-rigid satellites with uncertain dynamics.
- [71] arXiv:2407.17618 [pdf, other]
-
Title: Productive self/vulnerable body: self-tracking, overworking culture, and conflicted data practicesSubjects: Computers and Society (cs.CY)
Self-tracking, the collection, analysis, and interpretation of personal data, signifies an individualized way of health governance as people are demanded to build a responsible self by internalizing norms. However, the technological promises often bear conflicts with various social factors such as a strenuous schedule, a lack of motivation, stress, and anxieties, which fail to deliver health outcomes. To re-problematize the phenomenon, this paper situates self-tracking in an overworking culture in China and draws on semi structured and in depth interviews with overworking individuals to reveal the patterns in users interactions and interpretations with self-tracking data. It builds on the current literature of self-tracking and engages with theories from Science and Technology Studies, especially sociomaterial assemblages (Lupton 2016) and technological mediation (Verbeek 2005), to study self-tracking in a contextualized way which connects the micro (data reading, visualization, and affective elements in design) with the macro (work and workplaces, socioeconomic and political background) contexts of self-tracking. Drawing on investigation of the social context that users of self-tracking technologies internalize, reflect, or resist, the paper argues that the productivity and value oriented assumptions and workplace culture shape the imaginary of intensive (and sometimes impossible) self-care and health, an involution of competence embedded in the technological design and users affective experiences. Users respond by enacting different design elements and social contexts to frame two distinctive data practices of self-tracking.
- [72] arXiv:2407.17619 [pdf, html, other]
-
Title: The Power of Graph Sparsification in the Continual Release ModelSubjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR)
The graph continual release model of differential privacy seeks to produce differentially private solutions to graph problems under a stream of updates where new private solutions are released after each update. Streaming graph algorithms in the non-private literature also produce (approximately) accurate solutions when provided updates in a stream, but they additionally try to achieve two other goals: 1) output vertex or edge subsets as approximate solutions to the problem (not just real-valued estimates) and 2) use space that is sublinear in the number of edges or the number of vertices. Thus far, all previously known edge-differentially private algorithms for graph problems in the continual release setting do not meet the above benchmarks. Instead, they require computing exact graph statistics on the input [SLMVC18, FHO21, JSW24]. In this paper, we leverage sparsification to address the above shortcomings. Our edge-differentially private algorithms use sublinear space with respect to the number of edges in the graph while some also achieve sublinear space in the number of vertices in the graph. In addition, for most of our problems, we also output differentially private vertex subsets.
We make novel use of assorted sparsification techniques from the non-private streaming and static graph algorithms literature and achieve new results in the sublinear space, continual release setting for a variety of problems including densest subgraph, $k$-core decomposition, maximum matching, and vertex cover. In addition to our edge-differential privacy results, we use graph sparsification based on arboricity to obtain a set of results in the node-differential privacy setting, illustrating a new connection between sparsification and privacy beyond minimizing space. We conclude with polynomial additive error lower bounds for edge-privacy in the fully dynamic setting. - [73] arXiv:2407.17620 [pdf, html, other]
-
Title: CoMoTo: Unpaired Cross-Modal Lesion Distillation Improves Breast Lesion Detection in TomosynthesisMuhammad Alberb, Marawan Elbatel, Aya Elgebaly, Ricardo Montoya-del-Angel, Xiaomeng Li, Robert MartíComments: ADSMI @ MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Digital Breast Tomosynthesis (DBT) is an advanced breast imaging modality that offers superior lesion detection accuracy compared to conventional mammography, albeit at the trade-off of longer reading time. Accelerating lesion detection from DBT using deep learning is hindered by limited data availability and huge annotation costs. A possible solution to this issue could be to leverage the information provided by a more widely available modality, such as mammography, to enhance DBT lesion detection. In this paper, we present a novel framework, CoMoTo, for improving lesion detection in DBT. Our framework leverages unpaired mammography data to enhance the training of a DBT model, improving practicality by eliminating the need for mammography during inference. Specifically, we propose two novel components, Lesion-specific Knowledge Distillation (LsKD) and Intra-modal Point Alignment (ImPA). LsKD selectively distills lesion features from a mammography teacher model to a DBT student model, disregarding background features. ImPA further enriches LsKD by ensuring the alignment of lesion features within the teacher before distilling knowledge to the student. Our comprehensive evaluation shows that CoMoTo is superior to traditional pretraining and image-level KD, improving performance by 7% Mean Sensitivity under low-data setting. Our code is available at this https URL .
- [74] arXiv:2407.17622 [pdf, other]
-
Title: Towards Neural Network based Cognitive Models of Dynamic Decision-Making by HumansChangyu Chen, Shashank Reddy Chirra, Maria José Ferreira, Cleotilde Gonzalez, Arunesh Sinha, Pradeep VarakanthamSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Modelling human cognitive processes in dynamic decision-making tasks has been an endeavor in AI for a long time. Some initial works have attempted to utilize neural networks (and large language models) but often assume one common model for all humans and aim to emulate human behavior in aggregate. However, behavior of each human is distinct, heterogeneous and relies on specific past experiences in specific tasks. To that end, we build on a well known model of cognition, namely Instance Based Learning (IBL), that posits that decisions are made based on similar situations encountered in the past. We propose two new attention based neural network models to model human decision-making in dynamic settings. We experiment with two distinct datasets gathered from human subject experiment data, one focusing on detection of phishing email by humans and another where humans act as attackers in a cybersecurity setting and decide on an attack option. We conduct extensive experiments with our two neural network models, IBL, and GPT3.5, and demonstrate that one of our neural network models achieves the best performance in representing human decision-making. We find an interesting trend that all models predict a human's decision better if that human is better at the task. We also explore explanation of human decisions based on what our model considers important in prediction. Overall, our work yields promising results for further use of neural networks in cognitive modelling of human decision making. Our code is available at this https URL.
- [75] arXiv:2407.17623 [pdf, html, other]
-
Title: SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network AcceleratorSubjects: Machine Learning (cs.LG); Performance (cs.PF)
The design of energy-efficient, high-performance, and reliable Convolutional Neural Network (CNN) accelerators involves significant challenges due to complex power and thermal management issues. This paper introduces SAfEPaTh, a novel system-level approach for accurately estimating power and temperature in tile-based CNN accelerators. By addressing both steady-state and transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for comprehensive evaluation. Unlike traditional methods, it eliminates the need for circuit-level simulations or on-chip measurements. Our methodology leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator featuring analog-in-memory computing cores alongside digital cores. Through rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's capability to accurately estimate power and temperature within 500 seconds, encompassing CNN model accelerator mapping exploration and detailed power and thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable tool for designers, enabling them to optimize performance while adhering to stringent power and thermal constraints. Furthermore, SAfEPaTh's adaptability extends its utility across various CNN models and accelerator architectures, underscoring its broad applicability in the field. This study contributes significantly to the advancement of energy-efficient and reliable CNN accelerator designs, addressing critical challenges in dynamic power and thermal management.
- [76] arXiv:2407.17626 [pdf, html, other]
-
Title: Competitive Perimeter Defense in Tree EnvironmentsComments: 8 pages, 5 figuresSubjects: Systems and Control (eess.SY)
We consider a perimeter defense problem in a rooted full tree graph environment in which a single defending vehicle seeks to defend a set of specified vertices, termed as the perimeter from mobile intruders that enter the environment through the tree's leaves. We adopt the technique of competitive analysis to characterize the performance of an online algorithm for the defending vehicle. We first derive fundamental limits on the performance of any online algorithm relative to that of an optimal offline algorithm. Specifically, we give three fundamental conditions for finite, 2, and 3/2 competitive ratios in terms of the environment parameters. We then design and analyze three classes of online algorithms that have provably finite competitiveness under varying environmental parameter regimes. Finally, we give a numerical visualization of these regimes to better show the comparative strengths and weaknesses of each algorithm.
- [77] arXiv:2407.17628 [pdf, html, other]
-
Title: PEEKABOO: Hiding parts of an image for unsupervised object localizationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information such as the appearance, type and number of objects, as well as the lack of labeled object classes typically available in supervised settings. While recent approaches to unsupervised object localization have demonstrated significant progress by leveraging self-supervised visual representations, they often require computationally intensive training processes, resulting in high resource demands in terms of computation, learnable parameters, and data. They also lack explicit modeling of visual context, potentially limiting their accuracy in object localization. To tackle these challenges, we propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization by learning context-based representations at both the pixel- and shape-level of the localized objects through image masking. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision. The experimental results, both quantitative and qualitative, across various benchmark datasets, demonstrate the simplicity, effectiveness and competitive performance of our approach compared to state-of-the-art methods in both single object discovery and unsupervised salient object detection tasks. Code and pre-trained models are available at: this https URL
- [78] arXiv:2407.17629 [pdf, html, other]
-
Title: Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific PapersComments: to appear in DAGPAP 2024 proceedingsSubjects: Computation and Language (cs.CL)
This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the competition ended, achieving 99.46 (+9.63) of the F1-score on the official test set.
- [79] arXiv:2407.17630 [pdf, html, other]
-
Title: Revising the Problem of Partial Labels from the Perspective of CNNs' RobustnessSubjects: Computer Vision and Pattern Recognition (cs.CV)
Convolutional neural networks (CNNs) have gained increasing popularity and versatility in recent decades, finding applications in diverse domains. These remarkable achievements are greatly attributed to the support of extensive datasets with precise labels. However, annotating image datasets is intricate and complex, particularly in the case of multi-label datasets. Hence, the concept of partial-label setting has been proposed to reduce annotation costs, and numerous corresponding solutions have been introduced. The evaluation methods for these existing solutions have been primarily based on accuracy. That is, their performance is assessed by their predictive accuracy on the test set. However, we insist that such an evaluation is insufficient and one-sided. On one hand, since the quality of the test set has not been evaluated, the assessment results are unreliable. On the other hand, the partial-label problem may also be raised by undergoing adversarial attacks. Therefore, incorporating robustness into the evaluation system is crucial. For this purpose, we first propose two attack models to generate multiple partial-label datasets with varying degrees of label missing rates. Subsequently, we introduce a lightweight partial-label solution using pseudo-labeling techniques and a designed loss function. Then, we employ D-Score to analyze both the proposed and existing methods to determine whether they can enhance robustness while improving accuracy. Extensive experimental results demonstrate that while certain methods may improve accuracy, the enhancement in robustness is not significant, and in some cases, it even diminishes.
- [80] arXiv:2407.17631 [pdf, html, other]
-
Title: BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example LearningSubjects: Software Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Software bugs require developers to exert significant effort to identify and resolve them, often consuming about one-third of their time. Bug localization, the process of pinpointing the exact source code files that need modification, is crucial in reducing this effort. Existing bug localization tools, typically reliant on deep learning techniques, face limitations in cross-project applicability and effectiveness in multi-language environments. Recent advancements with Large Language Models (LLMs) offer detailed representations for bug localization. However, they encounter challenges with limited context windows and mapping accuracy. To address these issues, we propose BLAZE, an approach that employs dynamic chunking and hard example learning. First, BLAZE dynamically segments source code to minimize continuity loss. Then, BLAZE fine-tunes a GPT-based model using challenging bug cases, in order to enhance cross-project and cross-language bug localization. To support the capability of BLAZE, we create the BEETLEBOX dataset, which comprises 26,321 bugs from 29 large and thriving open-source projects across five different programming languages (Java, C++, Python, Go, and JavaScript). Our evaluations of BLAZE on three benchmark datasets BEETLEBOX, SWE-Bench, and Ye et al. demonstrate substantial improvements compared to six state-of-the-art baselines. Specifically, BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144% in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR). An extensive ablation study confirms the contributions of our pipeline components to the overall performance enhancement.
- [81] arXiv:2407.17633 [pdf, html, other]
-
Title: PICA: A Data-driven Synthesis of Peer Instruction and Continuous AssessmentComments: Accepted to RKDE Workshop at ECML-PKDD 2023Subjects: Computers and Society (cs.CY)
Peer Instruction (PI) and Continuous Assessment(CA) are two distinct educational techniques with extensive research demonstrating their effectiveness. The work herein combines PI and CA in a deliberate and novel manner to pair students together for a PI session in which they collaborate on a CA task. The data used to inform the pairing method is restricted to the most previous CA task students completed independently. The motivation for this data-driven collaborative learning is to improve student learning, communication, and engagement. Quantitative results from an investigation of the method show improved assessment scores on the PI CA tasks, although evidence of a positive effect on subsequent individual CA tasks was not statistically significant as anticipated. However, student perceptions were positive, engagement was high, and students interacted with a broader set of peers than is typical. These qualitative observations, together with extant research on the general benefits of improving student engagement and communication (e.g. improved sense of belonging, increased social capital, etc.), render the method worthy for further research into building and evaluating small student learning communities using student assessment data.
- [82] arXiv:2407.17636 [pdf, html, other]
-
Title: IgnitionInnovators at "Discharge Me!": Chain-of-Thought Instruction Finetuning Large Language Models for Discharge SummariesComments: Accepted by BioNLP2024 WorkshopSubjects: Computation and Language (cs.CL)
This paper presents our proposed approach to the Discharge Me! shared task, collocated with the 23th Workshop on Biomedical Natural Language Processing (BioNLP). In this work, we develop an LLM-based framework for solving the Discharge Summary Documentation (DSD) task, i.e., generating the two critical target sections `Brief Hospital Course' and `Discharge Instructions' in the discharge summary. By streamlining the recent instruction-finetuning process on LLMs, we explore several prompting strategies for optimally adapting LLMs to specific generation task of DSD. Experimental results show that providing a clear output structure, complimented by a set of comprehensive Chain-of-Thoughts (CoT) questions, effectively improves the model's reasoning capability, and thereby, enhancing the structural correctness and faithfulness of clinical information in the generated text. Source code is available at: this https URL
- [83] arXiv:2407.17638 [pdf, other]
-
Title: Time Matters: Examine Temporal Effects on Biomedical Language ModelsComments: Accept to AMIA 2024 Annual SymposiumSubjects: Computation and Language (cs.CL)
Time roots in applying language models for biomedical applications: models are trained on historical data and will be deployed for new or future data, which may vary from training data. While increasing biomedical tasks have employed state-of-the-art language models, there are very few studies have examined temporal effects on biomedical models when data usually shifts across development and deployment. This study fills the gap by statistically probing relations between language model performance and data shifts across three biomedical tasks. We deploy diverse metrics to evaluate model performance, distance methods to measure data drifts, and statistical methods to quantify temporal effects on biomedical language models. Our study shows that time matters for deploying biomedical language models, while the degree of performance degradation varies by biomedical tasks and statistical quantification approaches. We believe this study can establish a solid benchmark to evaluate and assess temporal effects on deploying biomedical language models.
- [84] arXiv:2407.17642 [pdf, html, other]
-
Title: SMA-Hyper: Spatiotemporal Multi-View Fusion Hypergraph Learning for Traffic Accident PredictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Predicting traffic accidents is the key to sustainable city management, which requires effective address of the dynamic and complex spatiotemporal characteristics of cities. Current data-driven models often struggle with data sparsity and typically overlook the integration of diverse urban data sources and the high-order dependencies within them. Additionally, they frequently rely on predefined topologies or weights, limiting their adaptability in spatiotemporal predictions. To address these issues, we introduce the Spatiotemporal Multiview Adaptive HyperGraph Learning (SMA-Hyper) model, a dynamic deep learning framework designed for traffic accident prediction. Building on previous research, this innovative model incorporates dual adaptive spatiotemporal graph learning mechanisms that enable high-order cross-regional learning through hypergraphs and dynamic adaptation to evolving urban data. It also utilises contrastive learning to enhance global and local data representations in sparse datasets and employs an advance attention mechanism to fuse multiple views of accident data and urban functional features, thereby enriching the contextual understanding of risk factors. Extensive testing on the London traffic accident dataset demonstrates that the SMA-Hyper model significantly outperforms baseline models across various temporal horizons and multistep outputs, affirming the effectiveness of its multiview fusion and adaptive learning strategies. The interpretability of the results further underscores its potential to improve urban traffic management and safety by leveraging complex spatiotemporal urban data, offering a scalable framework adaptable to diverse urban environments.
- [85] arXiv:2407.17643 [pdf, html, other]
-
Title: Robust Iterative Learning for Collaborative Road Profile Estimation and Active Suspension Control in Connected VehiclesSubjects: Systems and Control (eess.SY)
This paper presents the development of a new collaborative road profile estimation and active suspension control framework in connected vehicles, where participating vehicles iteratively refine the road profile estimation and enhance suspension control performance through an iterative learning scheme. Specifically, we develop a robust iterative learning approach to tackle the heterogeneity and model uncertainties in participating vehicles, which are important for practical implementations. In addition, the framework can be adopted as an add-on system to augment existing suspension control schemes. Comprehensive numerical studies are performed to evaluate and validate the proposed framework.
- [86] arXiv:2407.17645 [pdf, html, other]
-
Title: Hopfield Networks for Asset AllocationComments: 12 pages, 4 figuresSubjects: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM)
We present the first application of modern Hopfield networks to the problem of portfolio optimization. We performed an extensive study based on combinatorial purged cross-validation over several datasets and compared our results to both traditional and deep-learning-based methods for portfolio selection. Compared to state-of-the-art deep-learning methods such as Long-Short Term Memory networks and Transformers, we find that the proposed approach performs on par or better, while providing faster training times and better stability. Our results show that Modern Hopfield Networks represent a promising approach to portfolio optimization, allowing for an efficient, scalable, and robust solution for asset allocation, risk management, and dynamic rebalancing.
- [87] arXiv:2407.17647 [pdf, html, other]
-
Title: An Energy-Efficient Artefact Detection Accelerator on FPGAs for Hyper-Spectral Satellite ImageryJournal-ref: 27th Euromicro Conference on Digital System Design (DSD), 2024Subjects: Hardware Architecture (cs.AR)
Hyper-Spectral Imaging (HSI) is a crucial technique for analysing remote sensing data acquired from Earth observation satellites. The rich spatial and spectral information obtained through HSI allows for better characterisation and exploration of the Earth's surface over traditional techniques like RGB and Multi-Spectral imaging on the downlinked image data at ground stations. Sometimes, these images do not contain meaningful information due to the presence of clouds or other artefacts, limiting their usefulness. Transmission of such artefact HSI images leads to wasteful use of already scarce energy and time costs required for communication. While detecting such artefacts before transmitting the HSI image is desirable, the computational complexity of these algorithms and the limited power budget on satellites (especially CubeSats) are key constraints. This paper presents an unsupervised learning-based convolutional autoencoder (CAE) model for artefact identification of acquired HSI images at the satellite and a deployment architecture on AMD's Zynq Ultrascale FPGAs. The model is trained and tested on widely used HSI image datasets: Indian Pines, Salinas Valley, the University of Pavia and the Kennedy Space Center. For deployment, the model is quantised to 8-bit precision, fine-tuned using the Vitis-AI framework and integrated as a subordinate accelerator using AMD's Deep-Learning Processing Units (DPU) instance on the Zynq device. Our tests show that the model can process each spectral band in an HSI image in 4 ms, 2.6x better than INT8 inference on Nvidia's Jetson platform & 1.27x better than SOTA artefact detectors. Our model also achieves an f1-score of 92.8% and FPR of 0% across the dataset, while consuming 21.52 mJ per HSI image, 3.6x better than INT8 Jetson inference & 7.5x better than SOTA artefact detectors, making it a viable architecture for deployment in CubeSats.
- [88] arXiv:2407.17651 [pdf, html, other]
-
Title: PARS3: Parallel Sparse Skew-Symmetric Matrix-Vector Multiplication with Reverse Cuthill-McKee ReorderingSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Sparse matrices, as prevalent primitive of various scientific computing algorithms, persist as a bottleneck in processing. A skew-symmetric matrix flips signs of symmetric pairs in a symmetric matrix. Our work, Parallel 3-Way Banded Skew-Symmetric Sparse Matrix-Vector Multiplication, equally improves parallel symmetric SpMV kernels with a different perspective than the common literature trends, by manipulating the form of matrix in a preprocessing step to accelerate the repeated computations of iterative solvers. We effectively use Reverse Cuthill-McKee (RCM) reordering algorithm to transform a sparse skew-symmetrix matrix into a band matrix, then efficiently parallelize it by splitting the band structure into 3 different parts by considering its local sparsity. Our proposed method with RCM is novel in the sense that it is the first implementation of parallel skew-symmetric SpMV kernels. Our enhancements in SpMV and findings are valuable with significant strong scalings of up to 19x over the serial compressed SpMV implementation. We overperform a heuristic-based graph-coloring approach with synchronization phases in implementing parallel symmetric SpMVs. Our approach also naturally applies to parallel sparse symmetric SpMVs, that can inspire widespread SpMV solutions to adapt presented optimizations in this paper.
- [89] arXiv:2407.17654 [pdf, html, other]
-
Title: Generative Learning for Simulation of US Army Vehicle FaultsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop a novel generative model to simulate vehicle health and forecast faults, conditioned on practical operational considerations. The model, trained on data from the US Army's Predictive Logistics program, aims to support predictive maintenance. It forecasts faults far enough in advance to execute a maintenance intervention before a breakdown occurs. The model incorporates real-world factors that affect vehicle health. It also allows us to understand the vehicle's condition by analyzing operating data, and characterizing each vehicle into discrete states. Importantly, the model predicts the time to first fault with high accuracy. We compare its performance to other models and demonstrate its successful training.
- [90] arXiv:2407.17657 [pdf, other]
-
Title: My Ontologist: Evaluating BFO-Based AI for Definition SupportSubjects: Databases (cs.DB)
Generative artificial intelligence (AI), exemplified by the release of GPT-3.5 in 2022, has significantly advanced the potential applications of large language models (LLMs), including in the realms of ontology development and knowledge graph creation. Ontologies, which are structured frameworks for organizing information, and knowledge graphs, which combine ontologies with actual data, are essential for enabling interoperability and automated reasoning. However, current research has largely overlooked the generation of ontologies extending from established upper-level frameworks like the Basic Formal Ontology (BFO), risking the creation of non-integrable ontology silos. This study explores the extent to which LLMs, particularly GPT-4, can support ontologists trained in BFO. Through iterative development of a specialized GPT model named "My Ontologist," we aimed to generate BFO-conformant ontologies. Initial versions faced challenges in maintaining definition conventions and leveraging foundational texts effectively. My Ontologist 3.0 showed promise by adhering to structured rules and modular ontology suites, yet the release of GPT-4o disrupted this progress by altering the model's behavior. Our findings underscore the importance of aligning LLM-generated ontologies with top-level standards and highlight the complexities of integrating evolving AI capabilities in ontology engineering.
- [91] arXiv:2407.17663 [pdf, html, other]
-
Title: Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership InferenceComments: ICML 2024 Workshop on the Next Generation of AI SafetySubjects: Cryptography and Security (cs.CR)
Predictive machine learning models are becoming increasingly deployed in high-stakes contexts involving sensitive personal data; in these contexts, there is a trade-off between model explainability and data privacy. In this work, we push the boundaries of this trade-off: with a focus on foundation models for image classification fine-tuning, we reveal unforeseen privacy risks of post-hoc model explanations and subsequently offer mitigation strategies for such risks. First, we construct VAR-LRT and L1/L2-LRT, two new membership inference attacks based on feature attribution explanations that are significantly more successful than existing explanation-leveraging attacks, particularly in the low false-positive rate regime that allows an adversary to identify specific training set members with confidence. Second, we find empirically that optimized differentially private fine-tuning substantially diminishes the success of the aforementioned attacks, while maintaining high model accuracy. We carry out a systematic empirical investigation of our 2 new attacks with 5 vision transformer architectures, 5 benchmark datasets, 4 state-of-the-art post-hoc explanation methods, and 4 privacy strength settings.
- [92] arXiv:2407.17664 [pdf, other]
-
Title: SDLNet: Statistical Deep Learning Network for Co-Occurring Object Detection and IdentificationComments: 8 pages, 3 figures, ICMLT-2024. arXiv admin note: text overlap with arXiv:2403.17223Journal-ref: (ICMLT 2024) 2024 9th International Conference on Machine Learning TechnologiesSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the growing advances in deep learning based technologies the detection and identification of co-occurring objects is a challenging task which has many applications in areas such as, security and surveillance. In this paper, we propose a novel framework called SDLNet- Statistical analysis with Deep Learning Network that identifies co-occurring objects in conjunction with base objects in multilabel object categories. The pipeline of proposed work is implemented in two stages: in the first stage of SDLNet we deal with multilabel detectors for discovering labels, and in the second stage we perform co-occurrence matrix analysis. In co-occurrence matrix analysis, we learn co-occurrence statistics by setting base classes and frequently occurring classes, following this we build association rules and generate frequent patterns. The crucial part of SDLNet is recognizing base classes and making consideration for co-occurring classes. Finally, the generated co-occurrence matrix based on frequent patterns will show base classes and their corresponding co-occurring classes. SDLNet is evaluated on two publicly available datasets: Pascal VOC and MS-COCO. The experimental results on these benchmark datasets are reported in Sec 4.
- [93] arXiv:2407.17671 [pdf, html, other]
-
Title: Unsqueeze [CLS] Bottleneck to Learn Rich RepresentationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Distillation-based self-supervised learning typically leads to more compressed representations due to its radical clustering process and the implementation of a sharper target distribution. To overcome this limitation and preserve more information from input, we introduce UDI, conceptualized as Unsqueezed Distillation-based self-supervised learning (SSL). UDI enriches the learned representation by encouraging multimodal prediction distilled from a consolidated profile of local predictions that are derived via stratified sampling. Our evaluations show that UDI not only promotes semantically meaningful representations at instance level, delivering superior or competitive results to state-of-the-art SSL methods in image classification, but also effectively preserves the nuisance of input, which yields significant improvement in dense prediction tasks, including object detection and segmentation. Additionally, UDI performs competitively in low-shot image classification, improving the scalability of joint-embedding pipelines. Various visualizations and ablation studies are presented to further elucidate the mechanisms behind UDI. Our source code is available at this https URL.
- [94] arXiv:2407.17672 [pdf, html, other]
-
Title: Spiking Neural Networks in Vertical Federated Learning: Performance Trade-offsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated machine learning enables model training across multiple clients while maintaining data privacy. Vertical Federated Learning (VFL) specifically deals with instances where the clients have different feature sets of the same samples. As federated learning models aim to improve efficiency and adaptability, innovative neural network architectures like Spiking Neural Networks (SNNs) are being leveraged to enable fast and accurate processing at the edge. SNNs, known for their efficiency over Artificial Neural Networks (ANNs), have not been analyzed for their applicability in VFL, thus far. In this paper, we investigate the benefits and trade-offs of using SNN models in a vertical federated learning setting. We implement two different federated learning architectures -- with model splitting and without model splitting -- that have different privacy and performance implications. We evaluate the setup using CIFAR-10 and CIFAR-100 benchmark datasets along with SNN implementations of VGG9 and ResNET classification models. Comparative evaluations demonstrate that the accuracy of SNN models is comparable to that of traditional ANNs for VFL applications, albeit significantly more energy efficient.
- [95] arXiv:2407.17673 [pdf, html, other]
-
Title: CRASAR-U-DROIDs: A Large Scale Benchmark Dataset for Building Alignment and Damage Assessment in Georectified sUAS ImageryComments: 16 Pages, 7 Figures, 6 TablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
This document presents the Center for Robot Assisted Search And Rescue - Uncrewed Aerial Systems - Disaster Response Overhead Inspection Dataset (CRASAR-U-DROIDs) for building damage assessment and spatial alignment collected from small uncrewed aerial systems (sUAS) geospatial imagery. This dataset is motivated by the increasing use of sUAS in disaster response and the lack of previous work in utilizing high-resolution geospatial sUAS imagery for machine learning and computer vision models, the lack of alignment with operational use cases, and with hopes of enabling further investigations between sUAS and satellite imagery. The CRASAR-U-DRIODs dataset consists of fifty-two (52) orthomosaics from ten (10) federally declared disasters (Hurricane Ian, Hurricane Ida, Hurricane Harvey, Hurricane Idalia, Hurricane Laura, Hurricane Michael, Musset Bayou Fire, Mayfield Tornado, Kilauea Eruption, and Champlain Towers Collapse) spanning 67.98 square kilometers (26.245 square miles), containing 21,716 building polygons and damage labels, and 7,880 adjustment annotations. The imagery was tiled and presented in conjunction with overlaid building polygons to a pool of 130 annotators who provided human judgments of damage according to the Joint Damage Scale. These annotations were then reviewed via a two-stage review process in which building polygon damage labels were first reviewed individually and then again by committee. Additionally, the building polygons have been aligned spatially to precisely overlap with the imagery to enable more performant machine learning models to be trained. It appears that CRASAR-U-DRIODs is the largest labeled dataset of sUAS orthomosaic imagery.
- [96] arXiv:2407.17674 [pdf, html, other]
-
Title: Synthetic High-resolution Cryo-EM Density Maps with Generative Adversarial NetworksSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Generating synthetic cryogenic electron microscopy (cryo-EM) 3D density maps from molecular structures has potential important applications in structural biology. Yet existing simulation-based methods cannot mimic all the complex features present in experimental maps, such as secondary structure elements. As an alternative, we propose struc2mapGAN, a novel data-driven method that employs a generative adversarial network (GAN) to produce high-resolution experimental-like density maps from molecular structures. More specifically, struc2mapGAN uses a U-Net++ architecture as the generator, with an additional L1 loss term and further processing of raw experimental maps to enhance learning efficiency. While struc2mapGAN can promptly generate maps after training, we demonstrate that it outperforms existing simulation-based methods for a wide array of tested maps and across various evaluation metrics. Our code is available at this https URL.
- [97] arXiv:2407.17675 [pdf, html, other]
-
Title: Drawing ellipses and elliptical arcs with piecewise cubic B\'ezier curve approximationsComments: 8 figuresSubjects: Graphics (cs.GR)
This tutorial describes how to use piecewise cubic Bézier curves to draw arbitrarily oriented ellipses and elliptical arcs. The geometric principles discussed here result in strikingly simple interfaces for graphics functions that can draw (approximate) circles, ellipses, and arcs of circles and ellipses. C++ source code listings are included for these functions. Their code size can be relatively small because they are designed to be used with a graphics library or platform that draws Bézier curves, and the library or platform is tasked with the actual rendering of the curves.
- [98] arXiv:2407.17676 [pdf, html, other]
-
Title: Empowering the Quantum Cloud User with QRIOSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Quantum computing is moving swiftly from theoretical to practical applications, making it crucial to establish a significant quantum advantage. Despite substantial investments, access to quantum devices is still limited, with users facing issues like long wait times and inefficient resource management. Unlike the mature cloud solutions for classical computing, quantum computing lacks effective infrastructure for resource optimization. We propose a Quantum Resource Infrastructure Orchestrator (QRIO), a state-of-the-art cloud resource manager built on Kubernetes that is tailored to quantum computing. QRIO seeks to democratize access to quantum devices by providing customizable, user-friendly, open-source resource management. QRIO's design aims to ensure equitable access, optimize resource utilization, and support diverse applications, thereby speeding up innovation and making quantum computing more accessible and efficient to a broader user base. In this paper, we discuss QRIO's various features and evaluate its capability in several representative usecases.
- [99] arXiv:2407.17677 [pdf, other]
-
Title: Women's Participation in Computing: Evolving Research MethodsComments: 13 pages, 6 figuresSubjects: Computers and Society (cs.CY)
A 2022 keynote for the ACM History Committee on "Why SIG History Matters: New Data on Gender Bias in ACM's Founding SIGs 1970-2000" presented new data describing women's participation as research-article authors in 13 early ACM Special Interest Groups, finding significant growth in women's participation across 1970-2000 and, additionally, remarkable differences in women's participation between the SIGs. That presentation built on several earlier publications that developed a research method for assessing the number of women computer scientists that [a] are chronologically prior to the availability of the Bureau of Labor Statistics (BLS) data on women in the IT workforce; and [b] permit focused investigation of varied sub-fields within computing. This present report expands on these earlier articles, and their evolving research method, connecting them to the ACM SIG Heritage presentation. It also outlines some of the choices and considerations made in developing and refining "mixed methods" research (using both quantitative and qualitative approaches) as well as extensions of the research being currently explored.
- [100] arXiv:2407.17678 [pdf, html, other]
-
Title: Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention HeadsComments: 10 pagesSubjects: Computation and Language (cs.CL)
Existing LLM training and inference frameworks struggle in boosting efficiency with sparsity while maintaining the integrity of context and model architecture. Inspired by the sharding concept in database and the fact that attention parallelizes over heads on accelerators, we propose Sparsely-Sharded (S2) Attention, an attention algorithm that allocates heterogeneous context partitions for different attention heads to divide and conquer. S2-Attention enforces each attention head to only attend to a partition of contexts following a strided sparsity pattern, while the full context is preserved as the union of all the shards. As attention heads are processed in separate thread blocks, the context reduction for each head can thus produce end-to-end speed-up and memory reduction. At inference, LLMs trained with S2-Attention can then take the KV cache reduction as free meals with guaranteed model quality preserve. In experiments, we show S2-Attentioncan provide as much as (1) 25.3X wall-clock attention speed-up over FlashAttention-2, resulting in 6X reduction in end-to-end training time and 10X inference latency, (2) on-par model training quality compared to default attention, (3)perfect needle retrieval accuracy over 32K context window. On top of the algorithm, we build DKernel, an LLM training and inference kernel library that allows users to customize sparsity patterns for their own models. We open-sourced DKerneland make it compatible with Megatron, Pytorch, and vLLM.
- [101] arXiv:2407.17679 [pdf, html, other]
-
Title: Instagram versus women of color: Why are women of color protesting Instagram's algorithmic changes?Comments: 5 pages, Workshop ACM CHI 2023Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Instagram has been appropriated by communities for several contemporary social struggles, often translating into real world action. Likewise, women of color (WOC) have used it to protest, share information and support one another through its various affordances. However, Instagram is known to have frequent updates, and recently the updates have been more drastic. The newest update changed the recommendation algorithm such that it showed video-oriented content (reels) from unknown accounts over static media from a user's own network. Several marginalized communities, and especially WOC resisted this change and others that led to it. Due to the backlash, Instagram rolled back its changes. Drawing from past HCI work on digital platforms for marginalised communities, I propose a qualitative study informed by the open research strategy to understand why WOC are resisting these changes, and eventually provide implications for design that can help implement changes in a more inclusive manner.
- [102] arXiv:2407.17681 [pdf, html, other]
-
Title: DesignChecker: Visual Design Support for Blind and Low Vision Web DevelopersComments: Conditionally Accepted to UIST 2024Subjects: Human-Computer Interaction (cs.HC)
Blind and low vision (BLV) developers create websites to share knowledge and showcase their work. A well-designed website can engage audiences and deliver information effectively, yet it remains challenging for BLV developers to review their web designs. We conducted interviews with BLV developers (N=9) and analyzed 20 websites created by BLV developers. BLV developers created highly accessible websites but wanted to assess the usability of their websites for sighted users and follow the design standards of other websites. They also encountered challenges using screen readers to identify illegible text, misaligned elements, and inharmonious colors. We present DesignChecker, a browser extension that helps BLV developers improve their web designs. With DesignChecker, users can assess their current design by comparing it to visual design guidelines, a reference website of their choice, or a set of similar websites. DesignChecker also identifies the specific HTML elements that violate design guidelines and suggests CSS changes for improvements. Our user study participants (N=8) recognized more visual design errors than using their typical workflow and expressed enthusiasm about using DesignChecker in the future.
- [103] arXiv:2407.17683 [pdf, html, other]
-
Title: RL-augmented MPC Framework for Agile and Robust Bipedal Footstep Locomotion Planning and ControlComments: 8 pages, 7 figuresSubjects: Robotics (cs.RO)
This paper proposes an online bipedal footstep planning strategy that combines model predictive control (MPC) and reinforcement learning (RL) to achieve agile and robust bipedal maneuvers. While MPC-based foot placement controllers have demonstrated their effectiveness in achieving dynamic locomotion, their performance is often limited by the use of simplified models and assumptions. To address this challenge, we develop a novel foot placement controller that leverages a learned policy to bridge the gap between the use of a simplified model and the more complex full-order robot system. Specifically, our approach employs a unique combination of an ALIP-based MPC foot placement controller for sub-optimal footstep planning and the learned policy for refining footstep adjustments, enabling the resulting footstep policy to capture the robot's whole-body dynamics effectively. This integration synergizes the predictive capability of MPC with the flexibility and adaptability of RL. We validate the effectiveness of our framework through a series of experiments using the full-body humanoid robot DRACO 3. The results demonstrate significant improvements in dynamic locomotion performance, including better tracking of a wide range of walking speeds, enabling reliable turning and traversing challenging terrains while preserving the robustness and stability of the walking gaits compared to the baseline ALIP-based MPC approach.
- [104] arXiv:2407.17684 [pdf, html, other]
-
Title: Semi-Compressed CRYSTALS-KyberComments: 17 pages, 1 figureSubjects: Cryptography and Security (cs.CR)
In this paper, we investigate the communication overhead of the Kyber, which has recently been standardized by the National Institute of Standards and Technology (NIST). Given the same decryption failure rate (DFR) and security argument, we show it is feasible to reduce the communication overhead of the Kyber by 54%. The improvement is based on two technologies: ciphertext quantization and plaintext encoding. First, we prove that the Lloyd-Max quantization is optimal to minimize the decryption decoding noise. The original Kyber compression function is not optimal. Second, we propose an encoding scheme, which combines Pulse-Amplitude Modulation (PAM), Gray mapping, and a binary error correcting code. An explicit expression for the DFR is derived. The minimum possible communication overhead is also derived. Finally, we demonstrate that with the Lloyd-Max quantization, 8-PAM, Gray mapping, and a shortened binary BCH(768,638,13) code, the proposed scheme encapsulates 638 bits (e.g., 2.5 AES keys) in a single ciphertext.
- [105] arXiv:2407.17686 [pdf, html, other]
-
Title: Transformers on Markov Data: Constant Depth SufficesComments: 29 pages, 10 figuresSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (stat.ML)
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.
- [106] arXiv:2407.17687 [pdf, html, other]
-
Title: Overcome the Difficulties of NSGA-II via Truthful Crowding Distance with Theoretical GuaranteesSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
The NSGA-II is proven to encounter difficulties for more than two objectives, and the deduced reason is the crowding distance computed by regarding the different objectives independently. The recent theoretical efficiency of the NSGA-III and the SMS-EMOA also supports the deduced reason as both algorithms consider the dependencies of objectives in the second criterion after the non-dominated sorting but with complicated structure or difficult computation. However, there is still a question of whether a simple modification of the original crowding distance can help.
This paper proposes such a variant, called truthful crowding distance. This variant inherits the simple structure of summing the component for each objective. For each objective, it first sorts the set of solutions in order of descending objective values, and uses the smallest normalized L1 distance between the current solution and solutions in the earlier positions of the sorted list as the component. Summing up all components gives the value of truthful crowding distance. We call this NSGA-II variant by NSGA-II-T that replaces the original crowding distance with the truthful one, and that sequentially updates the crowding distance value after each removal.
We prove that the NSGA-II-T can efficiently cover the full Pareto front for many-objective mOneMinMax and mOJZJ, in contrast to the exponential runtime of the original NSGA-II. Besides, we also prove that it theoretically achieves a slightly better approximation of the Pareto front for OneMinMax than the original NSGA-II with sequential survival selection. Besides, it is the first NSGA-II variant with a simple structure that performs well for many objectives with theoretical guarantees. - [107] arXiv:2407.17688 [pdf, html, other]
-
Title: Examining the Influence of Political Bias on Large Language Model Performance in Stance ClassificationComments: Accepted at ICWSM 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated remarkable capabilities in executing tasks based on natural language queries. However, these models, trained on curated datasets, inherently embody biases ranging from racial to national and gender biases. It remains uncertain whether these biases impact the performance of LLMs for certain tasks. In this study, we investigate the political biases of LLMs within the stance classification task, specifically examining whether these models exhibit a tendency to more accurately classify politically-charged stances. Utilizing three datasets, seven LLMs, and four distinct prompting schemes, we analyze the performance of LLMs on politically oriented statements and targets. Our findings reveal a statistically significant difference in the performance of LLMs across various politically oriented stance classification tasks. Furthermore, we observe that this difference primarily manifests at the dataset level, with models and prompting schemes showing statistically similar performances across different stance classification datasets. Lastly, we observe that when there is greater ambiguity in the target the statement is directed towards, LLMs have poorer stance classification accuracy.
- [108] arXiv:2407.17689 [pdf, html, other]
-
Title: SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image ClassificationComments: accepted by ACM Multimedia 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Multiple Instance Learning (MIL) represents the predominant framework in Whole Slide Image (WSI) classification, covering aspects such as sub-typing, diagnosis, and beyond. Current MIL models predominantly rely on instance-level features derived from pretrained models such as ResNet. These models segment each WSI into independent patches and extract features from these local patches, leading to a significant loss of global spatial context and restricting the model's focus to merely local features. To address this issue, we propose a novel MIL framework, named SAM-MIL, that emphasizes spatial contextual awareness and explicitly incorporates spatial context by extracting comprehensive, image-level information. The Segment Anything Model (SAM) represents a pioneering visual segmentation foundational model that can capture segmentation features without the need for additional fine-tuning, rendering it an outstanding tool for extracting spatial context directly from raw WSIs. Our approach includes the design of group feature extraction based on spatial context and a SAM-Guided Group Masking strategy to mitigate class imbalance issues. We implement a dynamic mask ratio for different segmentation categories and supplement these with representative group features of categories. Moreover, SAM-MIL divides instances to generate additional pseudo-bags, thereby augmenting the training set, and introduces consistency of spatial context across pseudo-bags to further enhance the model's performance. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that our proposed SAM-MIL model outperforms existing mainstream methods in WSIs classification. Our open-source implementation code is is available at this https URL.
- [109] arXiv:2407.17691 [pdf, html, other]
-
Title: Design, Key Techniques and System-Level Simulation for NB-IoT NetworksShutao Zhang, Peiran Wu, Hongqing Huang, Liya Zhu, Yijia Guo, Wenkun Wen, Tingting Yang, Minghua XiaSubjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Narrowband Internet of Things (NB-IoT) is a promising technology designated specially by the 3rd Generation Partnership Project (3GPP) to meet the growing demand of massive machine-type communications (mMTC). More and more industrial companies choose NB-IoT network as the solution to mMTC due to its unique design and technical specification released by 3GPP. In order to evaluate the performance of NB-IoT network, we design a system-level simulation for NB-IoT network in this paper. In particular, the structure of system-level simulator are divided into four parts, i.e., initialization, pre-generation, main simulation loop and post-processing. Moreover, three key techniques are developed in the implementation of NB-IoT network by accounting for enhanced coverage, massive connection and low-power consumption. Simulation results demonstrate the cumulative distribution function curves of signal-to-interference-and-noise ratio are fully compliant with industrial standard, and the performance of throughput explains how NB-IoT network realize massive connection at the cost of data rate.
- [110] arXiv:2407.17695 [pdf, html, other]
-
Title: Enhancing Agent Learning through World Dynamics ModelingSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
While large language models (LLMs) have been increasingly deployed across tasks in language understanding and interactive decision-making, their impressive performance is largely due to the comprehensive and in-depth domain knowledge embedded within them. However, the extent of this knowledge can vary across different domains. Existing methods often assume that LLMs already possess such comprehensive and in-depth knowledge of their environment, overlooking potential gaps in their understanding of actual world dynamics. To address this gap, we introduce Discover, Verify, and Evolve (DiVE), a framework that discovers world dynamics from a small number of demonstrations, verifies the correctness of these dynamics, and evolves new, advanced dynamics tailored to the current situation. Through extensive evaluations, we analyze the impact of each component on performance and compare the automatically generated dynamics from DiVE with human-annotated world dynamics. Our results demonstrate that LLMs guided by DiVE can make better decisions, achieving rewards comparable to human players in the Crafter environment.
- [111] arXiv:2407.17696 [pdf, html, other]
-
Title: Satellite Internet of Things Research ReportSubjects: Systems and Control (eess.SY)
Satellite IoT is an emerging technology that combines the advantages of satellite communication and IoT, providing global coverage, high reliability, and flexible networking. It has a wide range of applications in various fields, including smart agriculture, smart transportation, smart cities, environmental monitoring, and emergency response. With the continuous development of satellite communication, IoT, edge computing, cloud computing, AI, and ML technology, satellite IoT will play an increasingly important role in the future, supporting the digital transformation and sustainable development of society.
- [112] arXiv:2407.17697 [pdf, html, other]
-
Title: Superior Scoring Rules for Probabilistic Evaluation of Single-Label Multi-Class Classification TasksComments: 21 Pages, 3 Figures, 3 TablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This study introduces novel superior scoring rules called Penalized Brier Score (PBS) and Penalized Logarithmic Loss (PLL) to improve model evaluation for probabilistic classification. Traditional scoring rules like Brier Score and Logarithmic Loss sometimes assign better scores to misclassifications in comparison with correct classifications. This discrepancy from the actual preference for rewarding correct classifications can lead to suboptimal model selection. By integrating penalties for misclassifications, PBS and PLL modify traditional proper scoring rules to consistently assign better scores to correct predictions. Formal proofs demonstrate that PBS and PLL satisfy strictly proper scoring rule properties while also preferentially rewarding accurate classifications. Experiments showcase the benefits of using PBS and PLL for model selection, model checkpointing, and early stopping. PBS exhibits a higher negative correlation with the F1 score compared to the Brier Score during training. Thus, PBS more effectively identifies optimal checkpoints and early stopping points, leading to improved F1 scores. Comparative analysis verifies models selected by PBS and PLL achieve superior F1 scores. Therefore, PBS and PLL address the gap between uncertainty quantification and accuracy maximization by encapsulating both proper scoring principles and explicit preference for true classifications. The proposed metrics can enhance model evaluation and selection for reliable probabilistic classification.
- [113] arXiv:2407.17699 [pdf, other]
-
Title: SOK: Blockchain for ProvenanceSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Provenance, which traces data from its creation to manipulation, is crucial for ensuring data integrity, reliability, and trustworthiness. It is valuable for single-user applications, collaboration within organizations, and across organizations. Blockchain technology has become a popular choice for implementing provenance due to its distributed, transparent, and immutable nature. Numerous studies on blockchain designs are specifically dedicated to provenance, and specialize in this area. Our goal is to provide a new perspective in blockchain based provenance field by identifying the challenges faced and suggesting future research directions. In this paper, we categorize the problem statement into three main research questions to investigate key issues comprehensively and propose a new outlook on the use of blockchains. The first focuses on challenges in non-collaborative, single-source environments, the second examines implications in collaborative environments and different domains such as supply chain, scientific collaboration and digital forensic, and the last one analyzes communication and data exchange challenges between organizations using different blockchains. The interconnected nature of these research questions ensures a thorough exploration of provenance requirements, leading to more effective and secure systems. After analyzing the requirements of provenance in different environments, we provide future design considerations for provenance-based blockchains, including blockchain type, query mechanisms, provenance capture methods, and domain-specific considerations. We also discuss future work and possible extensions in this field.
- [114] arXiv:2407.17703 [pdf, html, other]
-
Title: Context-aware knowledge graph framework for traffic speed forecasting using graph neural networkComments: 13 pages, 4 figuresSubjects: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
Human mobility is intricately influenced by urban contexts spatially and temporally, constituting essential domain knowledge in understanding traffic systems. While existing traffic forecasting models primarily rely on raw traffic data and advanced deep learning techniques, incorporating contextual information remains underexplored due to the lack of effective integration frameworks and the complexity of urban contexts. This study proposes a novel context-aware knowledge graph (CKG) framework to enhance traffic speed forecasting by effectively modeling spatial and temporal contexts. Employing a relation-dependent integration strategy, the framework generates context-aware representations from the spatial and temporal units of CKG to capture spatio-temporal dependencies of urban contexts. A CKG-GNN model, combining the CKG, dual-view multi-head self-attention (MHSA), and graph neural network (GNN), is then designed to predict traffic speed using these context-aware representations. Our experiments demonstrate that CKG's configuration significantly influences embedding performance, with ComplEx and KG2E emerging as optimal for embedding spatial and temporal units, respectively. The CKG-GNN model surpasses benchmark models, achieving an average MAE of $3.46\pm0.01$ and a MAPE of $14.76\pm0.09\%$ for traffic speed predictions from 10 to 120 minutes. The dual-view MHSA analysis reveals the crucial role of relation-dependent features from the context-based view and the model's ability to prioritize recent time slots in prediction from the sequence-based view. The CKG framework's model-agnostic nature suggests its potential applicability in various applications of intelligent transportation systems. Overall, this study underscores the importance of incorporating domain-specific contexts into traffic forecasting and merging context-aware knowledge graphs with neural networks to enhance accuracy.
- [115] arXiv:2407.17705 [pdf, other]
-
Title: ALMRR: Anomaly Localization Mamba on Industrial Textured Surface with Feature Reconstruction and RefinementSubjects: Computer Vision and Pattern Recognition (cs.CV)
Unsupervised anomaly localization on industrial textured images has achieved remarkable results through reconstruction-based methods, yet existing approaches based on image reconstruction and feature reconstruc-tion each have their own shortcomings. Firstly, image-based methods tend to reconstruct both normal and anomalous regions well, which lead to over-generalization. Feature-based methods contain a large amount of distin-guishable semantic information, however, its feature structure is redundant and lacks anomalous information, which leads to significant reconstruction errors. In this paper, we propose an Anomaly Localization method based on Mamba with Feature Reconstruction and Refinement(ALMRR) which re-constructs semantic features based on Mamba and then refines them through a feature refinement module. To equip the model with prior knowledge of anomalies, we enhance it by adding artificially simulated anomalies to the original images. Unlike image reconstruction or repair, the features of synthesized defects are repaired along with those of normal areas. Finally, the aligned features containing rich semantic information are fed in-to the refinement module to obtain the anomaly map. Extensive experiments have been conducted on the MVTec-AD-Textured dataset and other real-world industrial dataset, which has demonstrated superior performance com-pared to state-of-the-art (SOTA) methods.
- [116] arXiv:2407.17709 [pdf, html, other]
-
Title: PGD-VIO: An Accurate Plane-Aided Visual-Inertial Odometry with Graph-Based Drift SuppressionSubjects: Robotics (cs.RO)
Generally, high-level features provide more geometrical information compared to point features, which can be exploited to further constrain motions. Planes are commonplace in man-made environments, offering an active means to reduce drift, due to their extensive spatial and temporal observability. To make full use of planar information, we propose a novel visual-inertial odometry (VIO) using an RGBD camera and an inertial measurement unit (IMU), effectively integrating point and plane features in an extended Kalman filter (EKF) framework. Depth information of point features is leveraged to improve the accuracy of point triangulation, while plane features serve as direct observations added into the state vector. Notably, to benefit long-term navigation,a novel graph-based drift detection strategy is proposed to search overlapping and identical structures in the plane map so that the cumulative drift is suppressed subsequently. The experimental results on two public datasets demonstrate that our system outperforms state-of-the-art methods in localization accuracy and meanwhile generates a compact and consistent plane map, free of expensive global bundle adjustment and loop closing techniques.
- [117] arXiv:2407.17710 [pdf, html, other]
-
Title: Revisiting Machine Unlearning with Dimensional AlignmentSubjects: Machine Learning (cs.LG)
Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training instability and undesired side effects. To fundamentally approach this task, we first analyze the changes in latent feature spaces between original and retrained models, and observe that the feature representations of samples not involved in training are closely aligned with the feature manifolds of previously seen samples in training. Based on these findings, we introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples. We employ this metric as a regularizer loss to build a robust and stable unlearning framework, which is further enhanced by integrating a self-distillation loss and an alternating training scheme. Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set. Lastly, we identify critical flaws in established evaluation metrics for machine unlearning, and introduce new evaluation tools that more accurately reflect the fundamental goals of machine unlearning.
- [118] arXiv:2407.17712 [pdf, html, other]
-
Title: Improving Online Algorithms via ML PredictionsComments: Conference version appeared in Neurips 2018Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
In this work we study the problem of using machine-learned predictions to improve the performance of online algorithms. We consider two classical problems, ski rental and non-clairvoyant job scheduling, and obtain new online algorithms that use predictions to make their decisions. These algorithms are oblivious to the performance of the predictor, improve with better predictions, but do not degrade much if the predictions are poor.
- [119] arXiv:2407.17716 [pdf, html, other]
-
Title: Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the EnvironmentSubjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound repository. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. In addition, our proposed approach with an LLM yields better performance than our environment-agnostic baselines, especially in low signal-to-noise ratio (SNR) conditions. When testing at -5dB SNR level, our proposed method shows better performance than our best baseline model by 31.8 % (arousal), 23.5% (dominance), and 9.5% (valence).
- [120] arXiv:2407.17721 [pdf, html, other]
-
Title: A Two-Stage Imaging Framework Combining CNN and Physics-Informed Neural Networks for Full-Inverse Tomography: A Case Study in Electrical Impedance Tomography (EIT)Xuanxuan Yang (1 and 2), Yangming Zhang (1), Haofeng Chen (1 and 2), Gang Ma (2), Xiaojie Wang (1) ((1) the Institute of Intelligent Machines, Chinese Academy of Sciences, (2) University of Science and Technology of China)Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Physics-Informed Neural Networks (PINNs) are a machine learning technique for solving partial differential equations (PDEs) by incorporating PDEs as loss terms in neural networks and minimizing the loss function during training. Tomographic imaging, a method to reconstruct internal properties from external measurement data, is highly complex and ill-posed, making it an inverse problem. Recently, PINNs have shown significant potential in computational fluid dynamics (CFD) and have advantages in solving inverse problems. However, existing research has primarily focused on semi-inverse Electrical Impedance Tomography (EIT), where internal electric potentials are accessible. The practical full inverse EIT problem, where only boundary voltage measurements are available, remains challenging. To address this, we propose a two-stage hybrid learning framework combining Convolutional Neural Networks (CNNs) and PINNs to solve the full inverse EIT problem. This framework integrates data-driven and model-driven approaches, combines supervised and unsupervised learning, and decouples the forward and inverse problems within the PINN framework in EIT. Stage I: a U-Net constructs an end-to-end mapping from boundary voltage measurements to the internal potential distribution using supervised learning. Stage II: a Multilayer Perceptron (MLP)-based PINN takes the predicted internal potentials as input to solve for the conductivity distribution through unsupervised learning.
- [121] arXiv:2407.17722 [pdf, html, other]
-
Title: Text-Driven Neural Collaborative Filtering Model for Paper Source TracingComments: KDD CUP 2024 OAG-Challenges, Paper Source Tracing, Technical Report of Team AoboSama @ KDD CUP 2024. August 25--29, 2024. Barcelona, SpainSubjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Identifying significant references within the complex interrelations of a citation knowledge graph is challenging, which encompasses connections through citations, authorship, keywords, and other relational attributes. The Paper Source Tracing (PST) task seeks to automate the identification of pivotal references for given scholarly articles utilizing advanced data mining techniques. In the KDD CUP 2024, we design a recommendation-based framework tailored for the PST task. This framework employs the Neural Collaborative Filtering (NCF) model to generate final predictions. To process the textual attributes of the papers and extract input features for the model, we utilize SciBERT, a pre-trained language model. According to the experimental results, our method achieved a score of 0.37814 on the Mean Average Precision (MAP) metric, outperforming baseline models and ranking 11th among all participating teams. The source code is publicly available at this https URL.
- [122] arXiv:2407.17723 [pdf, html, other]
-
Title: Your Graph Recommender is Provably a Single-view Graph Contrastive LearningSubjects: Machine Learning (cs.LG)
Graph recommender (GR) is a type of graph neural network (GNNs) encoder that is customized for extracting information from the user-item interaction graph. Due to its strong performance on the recommendation task, GR has gained significant attention recently. Graph contrastive learning (GCL) is also a popular research direction that aims to learn, often unsupervised, GNNs with certain contrastive objectives. As a general graph representation learning method, GCLs have been widely adopted with the supervised recommendation loss for joint training of GRs. Despite the intersection of GR and GCL research, theoretical understanding of the relationship between the two fields is surprisingly sparse. This vacancy inevitably leads to inefficient scientific research.
In this paper, we aim to bridge the gap between the field of GR and GCL from the perspective of encoders and loss functions. With mild assumptions, we theoretically show an astonishing fact that graph recommender is equivalent to a commonly-used single-view graph contrastive model. Specifically, we find that (1) the classic encoder in GR is essentially a linear graph convolutional network with one-hot inputs, and (2) the loss function in GR is well bounded by a single-view GCL loss with certain hyperparameters. The first observation enables us to explain crucial designs of GR models, e.g., the removal of self-loop and nonlinearity. And the second finding can easily prompt many cross-field research directions. We empirically show a remarkable result that the recommendation loss and the GCL loss can be used interchangeably. The fact that we can train GR models solely with the GCL loss is particularly insightful, since before this work, GCLs were typically viewed as unsupervised methods that need fine-tuning. We also discuss some potential future works inspired by our theory. - [123] arXiv:2407.17726 [pdf, html, other]
-
Title: Multi-modal Data Binding for Survival Analysis Modeling with Incomplete Data and AnnotationsComments: Accepted by MICCAI 2024Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Survival analysis stands as a pivotal process in cancer treatment research, crucial for predicting patient survival rates accurately. Recent advancements in data collection techniques have paved the way for enhancing survival predictions by integrating information from multiple modalities. However, real-world scenarios often present challenges with incomplete data, particularly when dealing with censored survival labels. Prior works have addressed missing modalities but have overlooked incomplete labels, which can introduce bias and limit model efficacy. To bridge this gap, we introduce a novel framework that simultaneously handles incomplete data across modalities and censored survival labels. Our approach employs advanced foundation models to encode individual modalities and align them into a universal representation space for seamless fusion. By generating pseudo labels and incorporating uncertainty, we significantly enhance predictive accuracy. The proposed method demonstrates outstanding prediction accuracy in two survival analysis tasks on both employed datasets. This innovative approach overcomes limitations associated with disparate modalities and improves the feasibility of comprehensive survival analysis using multiple large foundation models.
- [124] arXiv:2407.17730 [pdf, html, other]
-
Title: Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, Bin HuSubjects: Computation and Language (cs.CL)
In contemporary society, the issue of psychological health has become increasingly prominent, characterized by the diversification, complexity, and universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently the most influential and clinically effective psychological treatment method with no side effects, has limited coverage and poor quality in most countries. In recent years, researches on the recognition and intervention of emotional disorders using large language models (LLMs) have been validated, providing new possibilities for psychological assistance therapy. However, are LLMs truly possible to conduct cognitive behavioral therapy? Many concerns have been raised by mental health experts regarding the use of LLMs for therapy. Seeking to answer this question, we collected real CBT corpus from online video websites, designed and conducted a targeted automatic evaluation framework involving the evaluation of emotion tendency of generated text, structured dialogue pattern and proactive inquiry ability. For emotion tendency, we calculate the emotion tendency score of the CBT dialogue text generated by each model. For structured dialogue pattern, we use a diverse range of automatic evaluation metrics to compare speaking style, the ability to maintain consistency of topic and the use of technology in CBT between different models . As for inquiring to guide the patient, we utilize PQA (Proactive Questioning Ability) metric. We also evaluated the CBT ability of the LLM after integrating a CBT knowledge base to explore the help of introducing additional knowledge to enhance the model's CBT counseling ability. Four LLM variants with excellent performance on natural language processing are evaluated, and the experimental result shows the great potential of LLMs in psychological counseling realm, especially after combining with other technological means.
- [125] arXiv:2407.17734 [pdf, html, other]
-
Title: Cost-effective Instruction Learning for Pathology Vision and Language AnalysisKaitao Chen, Mianxin Liu, Fang Yan, Lei Ma, Xiaoming Shi, Lilong Wang, Xiaosong Wang, Lifeng Zhu, Zhe Wang, Mu Zhou, Shaoting ZhangSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.
- [126] arXiv:2407.17737 [pdf, other]
-
Title: Control Informed Design of the IAC Autonomous Racecar for Operation at the Dynamic EnvelopeComments: 6 pages, 6 figures, 2021 IEEE International Conference on Robotics and Automation, Workshop on Opportunities and Challenges with Autonomous RacingSubjects: Systems and Control (eess.SY)
This article introduces the hardware-software co-design of the control system for an autonomy-enabled formula-style high-speed racecar that will be utilized as the deployment platform for high-level autonomy in the first ever head-to-head driverless race called the Indy Autonomous Challenge. The embedded control system needs to facilitate autonomous functionality, including perception, localization, and by-wire actuation, at high speeds and dynamic limits of the vehicle. Rapid maneuvering during the race, however, excites transient dynamics of the vehicle and the actuators. Compared to current autonomous driving focused on highway cruising and urban traffic, transient vehicle control imposes new challenges to the algorithm and system design. The presented work introduces the cascaded control structure employed by the IAC prototype to fully exploit the time scale separation between different control tasks. It is demonstrated by example way how the model-based control strategies and simulation are utilized to inform the decisions in the actuation, computation, perception, and software pipeline design decisions for the first-of-its-kind IAC racecar.
- [127] arXiv:2407.17738 [pdf, html, other]
-
Title: Enhancing Fine-grained Object Detection in Aerial Images via Orthogonal MappingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Fine-Grained Object Detection (FGOD) is a critical task in high-resolution aerial image analysis. This letter introduces Orthogonal Mapping (OM), a simple yet effective method aimed at addressing the challenge of semantic confusion inherent in FGOD. OM introduces orthogonal constraints in the feature space by decoupling features from the last layer of the classification branch with a class-wise orthogonal vector basis. This effectively mitigates semantic confusion and enhances classification accuracy. Moreover, OM can be seamlessly integrated into mainstream object detectors. Extensive experiments conducted on three FGOD datasets (FAIR1M, ShipRSImageNet, and MAR20) demonstrate the effectiveness and superiority of the proposed approach. Notably, with just one line of code, OM achieves a 4.08% improvement in mean Average Precision (mAP) over FCOS on the ShipRSImageNet dataset. Codes are released at this https URL.
- [128] arXiv:2407.17742 [pdf, html, other]
-
Title: A high-order, high-efficiency adaptive time filter algorithm for shale reservoir model based on coupled fluid flow with porous media flowSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
In this paper, a third-order time adaptive algorithm with less computation, low complexity is provided for shale reservoir model based on coupled fluid flow with porous media flow. The algorithm combines the three-step linear time filters method for simple post-processing and the second-order backward differential formula (BDF2), is third-order accurate and provides, at no extra computational complexity. At the same time, the time filter method can also be used to damp non-physical oscillations inherent in the BDF2 method, ensuring stability. We proves the variable time stepsize second-order backward differential formula plus time filter (BDF2-TF) algorithm's stability and the convergence properties of the fluid velocity u and hydraulic head $\phi$ in the $L^2$ norm with an order of $O(k_{n+1}^3 + h^3)$. In the experiments, the adaptive algorithm automatically adjusts the time step in response to the varying characteristics of different models, ensuring that errors are maintained within acceptable limits. This algorithm addresses the issue that high-order algorithms may select inappropriate time steps, resulting in instability or reduced precision of the numerical solution, thereby enhancing calculation accuracy and efficiency. We perform three-dimensional numerical experiments to verify the BDF2-TF algorithm's effectiveness, stability, and third-order convergence. Simultaneously, a simplified model is employed to simulate the process of shale oil extraction from reservoirs, further demonstrating the algorithm's practical applicability.
- [129] arXiv:2407.17743 [pdf, html, other]
-
Title: A Proposal for a Debugging Learning Support Environment for Undergraduate Students Majoring in Computer ScienceSubjects: Software Engineering (cs.SE)
In software development, encountering bugs is inevitable. However, opportunities to learn more about bug removal are limited. When students perform debugging tasks, they often use print statements because students do not know how to use a debugger or have never used this http URL this study, among various debugging methods, we focused on debugging using breakpoints. We implemented a function in Scratch, a visual programming language, that allows for self-learning of correct breakpoint placement and systematic debugging this http URL this paper, we discuss experimental results that clarify the changes that occur in subjects when they learn debugging in Scratch.
- [130] arXiv:2407.17744 [pdf, html, other]
-
Title: Balancing Complementarity and Consistency via Delayed Activation in Incomplete Multi-view ClusteringSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper study one challenging issue in incomplete multi-view clustering, where valuable complementary information from other views is always ignored. To be specific, we propose a framework that effectively balances Complementarity and Consistency information in Incomplete Multi-view Clustering (CoCo-IMC). Specifically, we design a dual network of delayed activation, which achieves a balance of complementarity and consistency among different views. The delayed activation could enriches the complementarity information that was ignored during consistency learning. Then, we recover the incomplete information and enhance the consistency learning by minimizing the conditional entropy and maximizing the mutual information across different views. This could be the first theoretical attempt to incorporate delayed activation into incomplete data recovery and the balance of complementarity and consistency. We have proved the effectiveness of CoCo-IMC in extensive comparative experiments with 12 state-of-the-art baselines on four publicly available datasets.
- [131] arXiv:2407.17745 [pdf, html, other]
-
Title: Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation SynergySubjects: Computation and Language (cs.CL)
Knowledge Graph Alignment (KGA) aims to integrate knowledge from multiple sources to address the limitations of individual Knowledge Graphs (KGs) in terms of coverage and depth. However, current KGA models fall short in achieving a ``complete'' knowledge graph alignment. Existing models primarily emphasize the linkage of cross-graph entities but overlook aligning relations across KGs, thereby providing only a partial solution to KGA. The semantic correlations embedded in relations are largely overlooked, potentially restricting a comprehensive understanding of cross-KG signals. In this paper, we propose to conceptualize relation alignment as an independent task and conduct KGA by decomposing it into two distinct but highly correlated sub-tasks: entity alignment and relation alignment. To capture the mutually reinforcing correlations between these objectives, we propose a novel Expectation-Maximization-based model, EREM, which iteratively optimizes both sub-tasks. Experimental results on real-world datasets demonstrate that EREM consistently outperforms state-of-the-art models in both entity alignment and relation alignment tasks.
- [132] arXiv:2407.17754 [pdf, html, other]
-
Title: DualFed: Enjoying both Generalization and Personalization in Federated Learning via Hierachical RepresentationsComments: Accepted by ACM MutltiMedia 2024Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
In personalized federated learning (PFL), it is widely recognized that achieving both high model generalization and effective personalization poses a significant challenge due to their conflicting nature. As a result, existing PFL methods can only manage a trade-off between these two objectives. This raises an interesting question: Is it feasible to develop a model capable of achieving both objectives simultaneously? Our paper presents an affirmative answer, and the key lies in the observation that deep models inherently exhibit hierarchical architectures, which produce representations with various levels of generalization and personalization at different stages. A straightforward approach stemming from this observation is to select multiple representations from these layers and combine them to concurrently achieve generalization and personalization. However, the number of candidate representations is commonly huge, which makes this method infeasible due to high computational this http URL address this problem, we propose DualFed, a new method that can directly yield dual representations correspond to generalization and personalization respectively, thereby simplifying the optimization task. Specifically, DualFed inserts a personalized projection network between the encoder and classifier. The pre-projection representations are able to capture generalized information shareable across clients, and the post-projection representations are effective to capture task-specific information on local clients. This design minimizes the mutual interference between generalization and personalization, thereby achieving a win-win situation. Extensive experiments show that DualFed can outperform other FL methods. Code is available at this https URL.
- [133] arXiv:2407.17755 [pdf, html, other]
-
Title: Enhancing Eye Disease Diagnosis with Deep Learning and Synthetic Data AugmentationComments: 18 pages, 7 figures, 2 TablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, the focus is on improving the diagnosis of diabetic retinopathy (DR) using machine learning and deep learning technologies. Researchers have explored various approaches, including the use of high-definition medical imaging, AI-driven algorithms such as convolutional neural networks (CNNs) and generative adversarial networks (GANs). Among all the available tools, CNNs have emerged as a preferred tool due to their superior classification accuracy and efficiency. Although the accuracy of CNNs is comparatively better but it can be improved by introducing some hybrid models by combining various machine learning and deep learning models. Therefore, in this paper, an ensemble learning technique is proposed for early detection and management of DR with higher accuracy. The proposed model is tested on the APTOS dataset and it is showing supremacy on the validation accuracy ($99\%)$ in comparison to the previous models. Hence, the model can be helpful for early detection and treatment of the DR, thereby enhancing the overall quality of care for affected individuals.
- [134] arXiv:2407.17756 [pdf, other]
-
Title: Preliminary Results of Neuromorphic Controller Design and a Parkinson's Disease Dataset Building for Closed-Loop Deep Brain StimulationSubjects: Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Parkinson's Disease afflicts millions of individuals globally. Emerging as a promising brain rehabilitation therapy for Parkinson's Disease, Closed-loop Deep Brain Stimulation (CL-DBS) aims to alleviate motor symptoms. The CL-DBS system comprises an implanted battery-powered medical device in the chest that sends stimulation signals to the brains of patients. These electrical stimulation signals are delivered to targeted brain regions via electrodes, with the magnitude of stimuli adjustable. However, current CL-DBS systems utilize energy-inefficient approaches, including reinforcement learning, fuzzy interface, and field-programmable gate array (FPGA), among others. These approaches make the traditional CL-DBS system impractical for implanted and wearable medical devices. This research proposes a novel neuromorphic approach that builds upon Leaky Integrate and Fire neuron (LIF) controllers to adjust the magnitude of DBS electric signals according to the various severities of PD patients. Our neuromorphic controllers, on-off LIF controller, and dual LIF controller, successfully reduced the power consumption of CL-DBS systems by 19% and 56%, respectively. Meanwhile, the suppression efficiency increased by 4.7% and 6.77%. Additionally, to address the data scarcity of Parkinson's Disease symptoms, we built Parkinson's Disease datasets that include the raw neural activities from the subthalamic nucleus at beta oscillations, which are typical physiological biomarkers for Parkinson's Disease.
- [135] arXiv:2407.17757 [pdf, html, other]
-
Title: CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus AttentionsHaicheng Liao, Haoyu Sun, Huanming Shen, Chengyue Wang, Kahou Tam, Chunlin Tian, Li Li, Chengzhong Xu, Zhenning LiSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs). This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Specifically, we develop the object-aware module to prioritize high-risk objects in complex and ambiguous environments by calculating the spatial-temporal relationships between traffic agents. In parallel, the context-aware is also devised to extend global visual information from the temporal to the frequency domain using the Fast Fourier Transform (FFT) and capture fine-grained visual features of potential objects and broader context cues within traffic scenes. To capture a wider range of visual cues, we further propose a multi-layer fusion that dynamically computes the temporal dependencies between different scenes and iteratively updates the correlations between different visual features for accurate and timely accident prediction. Evaluated on real-world datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D) datasets--our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA). Importantly, its robustness and adaptability are particularly evident in challenging driving scenarios with missing or limited training data, demonstrating significant potential for application in real-world autonomous driving systems.
- [136] arXiv:2407.17760 [pdf, html, other]
-
Title: TwIPS: A Large Language Model Powered Texting Application to Simplify Conversational Nuances for Autistic UsersSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Autistic individuals often experience difficulties in conveying and interpreting emotional tone and non-literal nuances. Many also mask their communication style to avoid being misconstrued by others, spending considerable time and mental effort in the process. To address these challenges in text-based communication, we present TwIPS, a prototype texting application powered by a large language model (LLM), which can assist users with: a) deciphering tone and meaning of incoming messages, b) ensuring the emotional tone of their message is in line with their intent, and c) coming up with alternate phrasing for messages that could be misconstrued and received negatively by others. We leverage an AI-based simulation and a conversational script to evaluate TwIPS with 8 autistic participants in an in-lab setting. Our findings show TwIPS enables a convenient way for participants to seek clarifications, provides a better alternative to tone indicators, and facilitates constructive reflection on writing technique and style. We also examine how autistic users utilize language for self-expression and interpretation in instant messaging, and gather feedback for enhancing our prototype. We conclude with a discussion around balancing user-autonomy with AI-mediation, establishing appropriate trust levels in AI systems, and customization needs if autistic users in the context of AI-assisted communication
- [137] arXiv:2407.17761 [pdf, html, other]
-
Title: Towards the Blockchain Massive Adoption with Permissionless StorageSubjects: Cryptography and Security (cs.CR)
Blockchain technology emerged with the advent of Bitcoin and rapidly developed over the past few decades, becoming widely accepted and known by the public. However, in the past decades, the massive adoption of blockchain technology has yet to come. Rather than the scalability issue, the blockchain application is challenged by its expensive usage cost. However, the high cost of blockchain usage is deeply connected with the blockchain consensus and security mechanism. The permissionless blockchain must maintain its high cost for security against the 51% Attack. Chain users indirectly cover the cost as coins are appointed for blockchain usage fees. This conflict prevents the massive adoption of blockchain. Thus, blockchain must be improved to solve those problems: 1. The cost of blockchain usage should be low enough. 2. The blockchain should remain decentralized. 3. The scalability of blockchain must meet the demand.
In my thesis, new approaches are applied to solve the issues above. The key contribution is the discovery of the useful PoW. It extends the Nakamoto PoW with another usage of file data encoding during the same Nakamoto Consensus computation to prove honest data preservation. Based on this theory, a permissionless storage network is proposed as the new security engine for the blockchain. It bridges the high blockchain security cost to the storage users with real demands who are willing to pay for the storage resource. On the other hand, the chain users can benefit from the low transaction fee. Meanwhile, we also provide a scalability solution to shard the blockchain. It enables high TPS and keeps decentralization. The solutions in this thesis provide the answers to all the dependencies of the massive adoption. - [138] arXiv:2407.17762 [pdf, other]
-
Title: Mpox Detection Advanced: Rapid Epidemic Response Through Synthetic DataYudara Kularathne, Prathapa Janitha, Sithira Ambepitiya, Prarththanan Sothyrajah, Thanveer Ahamed, Dinuka WijesundaraComments: 8 pages, 4 figures, 1 tableSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Rapid development of disease detection models using computer vision is crucial in responding to medical emergencies, such as epidemics or bioterrorism events. Traditional data collection methods are often too slow in these scenarios, requiring innovative approaches for quick, reliable model generation from minimal data. Our study introduces a novel approach by constructing a comprehensive computer vision model to detect Mpox lesions using only synthetic data. Initially, these models generated a diverse set of synthetic images representing Mpox lesions on various body parts (face, back, chest, leg, neck, arm) across different skin tones as defined by the Fitzpatrick scale (fair, brown, dark skin). Subsequently, we trained and tested a vision model with this synthetic dataset to evaluate the diffusion models' efficacy in producing high-quality training data and its impact on the vision model's medical image recognition performance. The results were promising; the vision model achieved a 97% accuracy rate, with 96% precision and recall for Mpox cases, and similarly high metrics for normal and other skin disorder cases, demonstrating its ability to correctly identify true positives and minimize false positives. The model achieved an F1-Score of 96% for Mpox cases and 98% for normal and other skin disorders, reflecting a balanced precision-recall relationship, thus ensuring reliability and robustness in its predictions. Our proposed SynthVision methodology indicates the potential to develop accurate computer vision models with minimal data input for future medical emergencies.
- [139] arXiv:2407.17763 [pdf, html, other]
-
Title: Randomized greedy algorithms for neural network optimizationComments: 26 pages, 12 figuresSubjects: Numerical Analysis (math.NA)
Greedy algorithms have been successfully analyzed and applied in training neural networks for solving variational problems, ensuring guaranteed convergence orders. However, their practical applicability is limited due to the subproblems, which often require an exhaustive search over a discrete dictionary and incur significant computational costs. This limitation becomes critical, especially in high-dimensional problems. In this paper, we propose a more practical approach of randomly discretizing the dictionary at each iteration of the greedy algorithm. We quantify the required size of the randomized discrete dictionary and prove that, with high probability, the proposed algorithm realizes a weak greedy algorithm, achieving optimal convergence orders. Through numerous numerical experiments, we demonstrate the advantage of using randomized discrete dictionaries over a deterministic one by showing orders of magnitude reductions in the size of the discrete dictionary, particularly in higher dimensions.
- [140] arXiv:2407.17765 [pdf, html, other]
-
Title: Utilizing Blockchain and Smart Contracts for Enhanced Fraud Prevention and Minimization in Health Insurance through Multi-Signature Claim ProcessingComments: 2024 IEEE 4th International Conference on Emerging Trends in Networks and Computer Communications (ETNCC 2024Subjects: Cryptography and Security (cs.CR)
Healthcare insurance provides financial support to access medical services for patients while ensuring timely and guaranteed payment for providers. Insurance fraud poses a significant challenge to insurance companies and policyholders, leading to increased costs and compromised healthcare treatment and service delivery. Most frauds, like phantom billing, upcoding, and unbundling, happen due to the lack of required entity participation. Also, claim activities are not transparent and accountable. Fraud can be prevented and minimized by involving every entity and making actions transparent and accountable. This paper proposes a blockchain-powered smart contract-based insurance claim processing mechanism to prevent and minimize fraud in response to this prevailing issue. All entities patients, providers, and insurance companies actively participate in the claim submission, approval, and acknowledgment process through a multi-signature technique. Also, every activity is captured and recorded in the blockchain using smart contracts to make every action transparent and accountable so that no entity can deny its actions and responsibilities. Blockchains' immutable storage property and strong integrity guarantee that recorded activities are not modified. As healthcare systems and insurance companies continue to deal with fraud challenges, this proposed approach holds the potential to significantly reduce fraudulent activities, ultimately benefiting both insurers and policyholders.
- [141] arXiv:2407.17766 [pdf, html, other]
-
Title: Strategic Pseudo-Goal Perturbation for Deadlock-Free Multi-Agent Navigation in Social Mini-GamesSubjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
This work introduces a Strategic Pseudo-Goal Perturbation (SPGP) technique, a novel approach to resolve deadlock situations in multi-agent navigation scenarios. Leveraging the robust framework of Safety Barrier Certificates, our method integrates a strategic perturbation mechanism that guides agents through social mini-games where deadlock and collision occur frequently. The method adopts a strategic calculation process where agents, upon encountering a deadlock select a pseudo goal within a predefined radius around the current position to resolve the deadlock among agents. The calculation is based on controlled strategic algorithm, ensuring that deviation towards pseudo-goal is both purposeful and effective in resolution of deadlock. Once the agent reaches the pseudo goal, it resumes the path towards the original goal, thereby enhancing navigational efficiency and safety. Experimental results demonstrates SPGP's efficacy in reducing deadlock instances and improving overall system throughput in variety of multi-agent navigation scenarios.
- [142] arXiv:2407.17767 [pdf, html, other]
-
Title: Online Learning for Autonomous Management of Intent-based 6G NetworksErciyes Karakaya, Ozgur Ercetin, Huseyin Ozkan, Mehmet Karaca, Elham Dehghan Biyar, Alexandros PalaiosSubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
The growing complexity of networks and the variety of future scenarios with diverse and often stringent performance requirements call for a higher level of automation. Intent-based management emerges as a solution to attain high level of automation, enabling human operators to solely communicate with the network through high-level intents. The intents consist of the targets in the form of expectations (i.e., latency expectation) from a service and based on the expectations the required network configurations should be done accordingly. It is almost inevitable that when a network action is taken to fulfill one intent, it can cause negative impacts on the performance of another intent, which results in a conflict. In this paper, we aim to address the conflict issue and autonomous management of intent-based networking, and propose an online learning method based on the hierarchical multi-armed bandits approach for an effective management. Thanks to this hierarchical structure, it performs an efficient exploration and exploitation of network configurations with respect to the dynamic network conditions. We show that our algorithm is an effective approach regarding resource allocation and satisfaction of intent expectations.
- [143] arXiv:2407.17770 [pdf, html, other]
-
Title: BotEval: Facilitating Interactive Human EvaluationComments: ACL 2024 SDT, 10 pagesSubjects: Computation and Language (cs.CL)
Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms. We showcase the numerous useful features of BotEval through a study that evaluates the performance of various chatbots on their effectiveness for conversational moderation and discuss how BotEval differs from other annotation tools.
- [144] arXiv:2407.17771 [pdf, html, other]
-
Title: Banyan: Improved Representation Learning with Explicit StructureComments: First DraftSubjects: Computation and Language (cs.CL)
We present Banyan, an improved model to learn semantic representations by inducing explicit structure over data. In contrast to prior approaches using structure spanning single sentences, Banyan learns by resolving multiple constituent structures into a shared one explicitly incorporating global context. Combined with an improved message-passing scheme inspired by Griffin, Banyan learns significantly better representations, avoids spurious false negatives with contrastive learning, and drastically improves memory efficiency in such explicit-structured models. Using the Self-StrAE framework, we show that Banyan (a) outperforms baselines using sentential structure across various settings (b) matches or outperforms unstructured baselines like GloVe (+augmentations) and a RoBERTa medium (+simcse) pre-trained on 100M tokens, despite having just a handful of (non-embedding) parameters, and (c) also learns effective representations across several low resource (Asian and African) languages as measured on SemRel tasks.
- [145] arXiv:2407.17772 [pdf, html, other]
-
Title: ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion EvaluationSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research.
- [146] arXiv:2407.17773 [pdf, html, other]
-
Title: KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal ModelsEunice Yiu, Maan Qraitem, Charlie Wong, Anisa Noor Majhi, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate SaenkoComments: 9 pages. For the KiVA benchmark, see this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 1,400 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children and adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-4V, performs better in tasks involving simple visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of the 3D physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.
- [147] arXiv:2407.17779 [pdf, html, other]
-
Title: DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and CorrectionComments: accepted by ACM MM 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).
- [148] arXiv:2407.17781 [pdf, other]
-
Title: Integrating Ensemble Kalman Filter with AI-based Weather Prediction Model ClimaXSubjects: Machine Learning (cs.LG); Applications (stat.AP)
Artificial intelligence (AI)-based weather prediction research is growing rapidly and has shown to be competitive with the advanced dynamic numerical weather prediction models. However, research combining AI-based weather prediction models with data assimilation remains limited partially because long-term sequential data assimilation cycles are required to evaluate data assimilation systems. This study explores integrating the local ensemble transform Kalman filter (LETKF) with an AI-based weather prediction model ClimaX. Our experiments demonstrated that the ensemble data assimilation cycled stably for the AI-based weather prediction model using covariance inflation and localization techniques inside the LETKF. While ClimaX showed some limitations in capturing flow-dependent error covariance compared to dynamical models, the AI-based ensemble forecasts provided reasonable and beneficial error covariance in sparsely observed regions. These findings highlight the potential of AI models in weather forecasting and the importance of physical consistency and accurate error growth representation in improving ensemble data assimilation.
- [149] arXiv:2407.17783 [pdf, html, other]
-
Title: How Lightweight Can A Vision Transformer BeSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In this paper, we explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers. Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer. No complex attention or convolutional mechanisms are employed. Depth-wise scaling is applied to progressively reduce the size of the hidden layer and the number of experts is increased in stages. Grouped query attention is used. We studied the proposed approach with and without pre-training on small datasets and investigated whether transfer learning works at this scale. We found that the architecture is competitive even at a size of 0.67M parameters.
- [150] arXiv:2407.17786 [pdf, html, other]
-
Title: Topology-Preserving Downsampling of Binary ImagesComments: Accepted to The 18th European Conference on Computer Vision (ECCV) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We present a novel discrete optimization-based approach to generate downsampled versions of binary images that are guaranteed to have the same topology as the original, measured by the zeroth and first Betti numbers of the black regions, while having good similarity to the original image as measured by IoU and Dice scores. To our best knowledge, all existing binary image downsampling methods do not have such topology-preserving guarantees. We also implemented a baseline morphological operation (dilation)-based approach that always generates topologically correct results. However, we found the similarity scores to be much worse. We demonstrate several applications of our approach. First, generating smaller versions of medical image segmentation masks for easier human inspection. Second, improving the efficiency of binary image operations, including persistent homology computation and shortest path computation, by substituting the original images with smaller ones. In particular, the latter is a novel application that is made feasible only by the full topology-preservation guarantee of our method.
- [151] arXiv:2407.17787 [pdf, html, other]
-
Title: HC-GST: Heterophily-aware Distribution Consistency based Graph Self-trainingComments: accepted by CIKM 2024Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Graph self-training (GST), which selects and assigns pseudo-labels to unlabeled nodes, is popular for tackling label sparsity in graphs. However, recent study on homophily graphs show that GST methods could introduce and amplify distribution shift between training and test nodes as they tend to assign pseudo-labels to nodes they are good at. As GNNs typically perform better on homophilic nodes, there could be potential shifts towards homophilic pseudo-nodes, which is underexplored. Our preliminary experiments on heterophilic graphs verify that these methods can cause shifts in homophily ratio distributions, leading to \textit{training bias} that improves performance on homophilic nodes while degrading it on heterophilic ones. Therefore, we study a novel problem of reducing homophily ratio distribution shifts during self-training on heterophilic graphs. A key challenge is the accurate calculation of homophily ratios and their distributions without extensive labeled data. To tackle them, we propose a novel Heterophily-aware Distribution Consistency-based Graph Self-Training (HC-GST) framework, which estimates homophily ratios using soft labels and optimizes a selection vector to align pseudo-nodes with the global homophily ratio distribution. Extensive experiments on both homophilic and heterophilic graphs show that HC-GST effectively reduces training bias and enhances self-training performance.
- [152] arXiv:2407.17788 [pdf, html, other]
-
Title: PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal RemediationSubjects: Cryptography and Security (cs.CR)
Recent advances in Large Language Models (LLMs) have shown significant potential in enhancing cybersecurity defenses against sophisticated threats. LLM-based penetration testing is an essential step in automating system security evaluations by identifying vulnerabilities. Remediation, the subsequent crucial step, addresses these discovered vulnerabilities. Since details about vulnerabilities, exploitation methods, and software versions offer crucial insights into system weaknesses, integrating penetration testing with vulnerability remediation into a cohesive system has become both intuitive and necessary.
This paper introduces PenHeal, a two-stage LLM-based framework designed to autonomously identify and mitigate security vulnerabilities. The framework integrates two LLM-enabled components: the Pentest Module, which detects multiple vulnerabilities within a system, and the Remediation Module, which recommends optimal remediation strategies. The integration is facilitated through Counterfactual Prompting and an Instructor module that guides the LLMs using external knowledge to explore multiple potential attack paths effectively. Our experimental results demonstrate that PenHeal not only automates the identification and remediation of vulnerabilities but also significantly improves vulnerability coverage by 31%, increases the effectiveness of remediation strategies by 32%, and reduces the associated costs by 46% compared to baseline models. These outcomes highlight the transformative potential of LLMs in reshaping cybersecurity practices, offering an innovative solution to defend against cyber threats. - [153] arXiv:2407.17789 [pdf, html, other]
-
Title: Very Large-Scale Multi-Agent Simulation in AgentScopeComments: We have released code on this https URLSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we develop several new features and components for AgentScope, a user-friendly multi-agent platform, enhancing its convenience and flexibility for supporting very large-scale multi-agent simulations. Specifically, we propose an actor-based distributed mechanism as the underlying technological infrastructure towards great scalability and high efficiency, and provide flexible environment support for simulating various real-world scenarios, which enables parallel execution of multiple agents, centralized workflow orchestration, and both inter-agent and agent-environment interactions among agents. Moreover, we integrate an easy-to-use configurable tool and an automatic background generation pipeline in AgentScope, simplifying the process of creating agents with diverse yet detailed background settings. Last but not least, we provide a web-based interface for conveniently monitoring and managing a large number of agents that might deploy across multiple devices. We conduct a comprehensive simulation to demonstrate the effectiveness of the proposed enhancements in AgentScope, and provide detailed observations and discussions to highlight the great potential of applying multi-agent systems in large-scale simulations. The source code is released on GitHub at this https URL to inspire further research and development in large-scale multi-agent simulations.
- [154] arXiv:2407.17790 [pdf, html, other]
-
Title: Exploring the Limitations of Kolmogorov-Arnold Networks in Classification: Insights to Software Training and Hardware Implementationan Duy Tran, Tran Xuan Hieu Le, Thi Diem Tran, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Tinh Nguyen, Yasuhiko NakashimaComments: 6 pages, 3 figures, 2 tablesSubjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
Kolmogorov-Arnold Networks (KANs), a novel type of neural network, have recently gained popularity and attention due to the ability to substitute multi-layer perceptions (MLPs) in artificial intelligence (AI) with higher accuracy and interoperability. However, KAN assessment is still limited and cannot provide an in-depth analysis of a specific domain. Furthermore, no study has been conducted on the implementation of KANs in hardware design, which would directly demonstrate whether KANs are truly superior to MLPs in practical applications. As a result, in this paper, we focus on verifying KANs for classification issues, which are a common but significant topic in AI using four different types of datasets. Furthermore, the corresponding hardware implementation is considered using the Vitis high-level synthesis (HLS) tool. To the best of our knowledge, this is the first article to implement hardware for KAN. The results indicate that KANs cannot achieve more accuracy than MLPs in high complex datasets while utilizing substantially higher hardware resources. Therefore, MLP remains an effective approach for achieving accuracy and efficiency in software and hardware implementation.
- [155] arXiv:2407.17791 [pdf, html, other]
-
Title: Investigating learning-independent abstract reasoning in artificial neural networksSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Humans are capable of solving complex abstract reasoning tests. Whether this ability reflects a learning-independent inference mechanism applicable to any novel unlearned problem or whether it is a manifestation of extensive training throughout life is an open question. Addressing this question in humans is challenging because it is impossible to control their prior training. However, assuming a similarity between the cognitive processing of Artificial Neural Networks (ANNs) and humans, the extent to which training is required for ANNs' abstract reasoning is informative about this question in humans. Previous studies demonstrated that ANNs can solve abstract reasoning tests. However, this success required extensive training. In this study, we examined the learning-independent abstract reasoning of ANNs. Specifically, we evaluated their performance without any pretraining, with the ANNs' weights being randomly-initialized, and only change in the process of problem solving. We found that naive ANN models can solve non-trivial visual reasoning tests, similar to those used to evaluate human learning-independent reasoning. We further studied the mechanisms that support this ability. Our results suggest the possibility of learning-independent abstract reasoning that does not require extensive training.
- [156] arXiv:2407.17792 [pdf, html, other]
-
Title: Harnessing Temporal Causality for Advanced Temporal Action DetectionComments: 1st in Moment Queries track at the Ego4D Challenge 2024; 1st in Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
As a fundamental task in long-form video understanding, temporal action detection (TAD) aims to capture inherent temporal relations in untrimmed videos and identify candidate actions with precise boundaries. Over the years, various networks, including convolutions, graphs, and transformers, have been explored for effective temporal modeling for TAD. However, these modules typically treat past and future information equally, overlooking the crucial fact that changes in action boundaries are essentially causal events. Inspired by this insight, we propose leveraging the temporal causality of actions to enhance TAD representation by restricting the model's access to only past or future context. We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on multiple benchmarks. Notably, with CausalTAD, we ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, as well as 1st in the Moment Queries track at the Ego4D Challenge 2024. Our code is available at this https URL.
- [157] arXiv:2407.17795 [pdf, html, other]
-
Title: Enhancing Diversity in Multi-objective Feature SelectionSevil Zanjani Miyandoab, Shahryar Rahnamayan, Azam Asilian Bidgoli, Sevda Ebrahimi, Masoud MakrehchiComments: 8 pages, 3 figures, accepted to be published in IEEE WCCI 2024 conferenceSubjects: Machine Learning (cs.LG)
Feature selection plays a pivotal role in the data preprocessing and model-building pipeline, significantly enhancing model performance, interpretability, and resource efficiency across diverse domains. In population-based optimization methods, the generation of diverse individuals holds utmost importance for adequately exploring the problem landscape, particularly in highly multi-modal multi-objective optimization problems. Our study reveals that, in line with findings from several prior research papers, commonly employed crossover and mutation operations lack the capability to generate high-quality diverse individuals and tend to become confined to limited areas around various local optima. This paper introduces an augmentation to the diversity of the population in the well-established multi-objective scheme of the genetic algorithm, NSGA-II. This enhancement is achieved through two key components: the genuine initialization method and the substitution of the worst individuals with new randomly generated individuals as a re-initialization approach in each generation. The proposed multi-objective feature selection method undergoes testing on twelve real-world classification problems, with the number of features ranging from 2,400 to nearly 50,000. The results demonstrate that replacing the last front of the population with an equivalent number of new random individuals generated using the genuine initialization method and featuring a limited number of features substantially improves the population's quality and, consequently, enhances the performance of the multi-objective algorithm.
- [158] arXiv:2407.17797 [pdf, html, other]
-
Title: A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training ModelsComments: 14 pages, 9 figures, published in ACMMM2024(oral)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to the multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves the black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.
- [159] arXiv:2407.17801 [pdf, html, other]
-
Title: EEG-SSM: Leveraging State-Space Model for Dementia DetectionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
State-space models (SSMs) have garnered attention for effectively processing long data sequences, reducing the need to segment time series into shorter intervals for model training and inference. Traditionally, SSMs capture only the temporal dynamics of time series data, omitting the equally critical spectral features. This study introduces EEG-SSM, a novel state-space model-based approach for dementia classification using EEG data. Our model features two primary innovations: EEG-SSM temporal and EEG-SSM spectral components. The temporal component is designed to efficiently process EEG sequences of varying lengths, while the spectral component enhances the model by integrating frequency-domain information from EEG signals. The synergy of these components allows EEG-SSM to adeptly manage the complexities of multivariate EEG data, significantly improving accuracy and stability across different temporal resolutions. Demonstrating a remarkable 91.0 percent accuracy in classifying Healthy Control (HC), Frontotemporal Dementia (FTD), and Alzheimer's Disease (AD) groups, EEG-SSM outperforms existing models on the same dataset. The development of EEG-SSM represents an improvement in the use of state-space models for screening dementia, offering more precise and cost-effective tools for clinical neuroscience.
- [160] arXiv:2407.17802 [pdf, html, other]
-
Title: Sample Enrichment via Temporary Operations on Subsequences for Sequential RecommendationComments: 12 pages, 6 figuresSubjects: Information Retrieval (cs.IR)
Sequential recommendation leverages interaction sequences to predict forthcoming user behaviors, crucial for crafting personalized recommendations. However, the true preferences of a user are inherently complex and high-dimensional, while the observed data is merely a simplified and low-dimensional projection of the rich preferences, which often leads to prevalent issues like data sparsity and inaccurate model training. To learn true preferences from the sparse data, most existing works endeavor to introduce some extra information or design some ingenious models. Although they have shown to be effective, extra information usually increases the cost of data collection, and complex models may result in difficulty in deployment. Innovatively, we avoid the use of extra information or alterations to the model; instead, we fill the transformation space between the observed data and the underlying preferences with randomness. Specifically, we propose a novel model-agnostic and highly generic framework for sequential recommendation called sample enrichment via temporary operations on subsequences (SETO), which temporarily and separately enriches the transformation space via sequence enhancement operations with rationality constraints in training. The transformation space not only exists in the process from input samples to preferences but also in preferences to target samples. We highlight our SETO's effectiveness and versatility over multiple representative and state-of-the-art sequential recommendation models (including six single-domain sequential models and two cross-domain sequential models) across multiple real-world datasets (including three single-domain datasets, three cross-domain datasets and a large-scale industry dataset).
- [161] arXiv:2407.17803 [pdf, html, other]
-
Title: Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?Comments: Accepted as a full paper in the technical track at The International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.
- [162] arXiv:2407.17813 [pdf, html, other]
-
Title: Enhancing Model Performance: Another Approach to Vision-Language Instruction TuningSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The integration of large language models (LLMs) with vision-language (VL) tasks has been a transformative development in the realm of artificial intelligence, highlighting the potential of LLMs as a versatile general-purpose chatbot. However, the current trend in this evolution focuses on the integration of vision and language to create models that can operate in more diverse and real-world contexts. We present a novel approach, termed Bottleneck Adapter, specifically crafted for enhancing the multimodal functionalities of these complex models, enabling joint optimization of the entire multimodal LLM framework through a process known as Multimodal Model Tuning (MMT). Our approach utilizes lightweight adapters to connect the image encoder and LLM without the need for large, complex neural networks. Unlike the conventional modular training schemes, our approach adopts an end-to-end optimization regime, which, when combined with the adapters, facilitates the joint optimization using a significantly smaller parameter set. Our method exhibits robust performance with 90.12\% accuracy, outperforming both human-level performance (88.4\%) and LaVIN-7B (89.41\%).
- [163] arXiv:2407.17814 [pdf, html, other]
-
Title: All-Pairs Suffix-Prefix on Dynamic Set of StringsComments: Accepted for SPIRE 2024Subjects: Data Structures and Algorithms (cs.DS)
The all-pairs suffix-prefix (APSP) problem is a classical problem in string processing which has important applications in bioinformatics. Given a set $\mathcal{S} = \{S_1, \ldots, S_k\}$ of $k$ strings, the APSP problem asks one to compute the longest suffix of $S_i$ that is a prefix of $S_j$ for all $k^2$ ordered pairs $\langle S_i, S_j \rangle$ of strings in $\mathcal{S}$. In this paper, we consider the dynamic version of the APSP problem that allows for insertions of new strings to the set of strings. Our objective is, each time a new string $S_i$ arrives to the current set $\mathcal{S}_{i-1} = \{S_1, \ldots, S_{i-1}\}$ of $i-1$ strings, to compute (1) the longest suffix of $S_i$ that is a prefix of $S_j$ and (2) the longest prefix of $S_i$ that is a suffix of $S_j$ for all $1 \leq j \leq i$. We propose an $O(n)$-space data structure which computes (1) and (2) in $O(|S_i| \log \sigma + i)$ time for each new given string $S_i$, where $n$ is the total length of the strings.
- [164] arXiv:2407.17815 [pdf, html, other]
-
Title: Nested replicator dynamics, nested logit choice, and similarity-based learningComments: 37 pages, 9 figuresSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
We consider a model of learning and evolution in games whose action sets are endowed with a partition-based similarity structure intended to capture exogenous similarities between strategies. In this model, revising agents have a higher probability of comparing their current strategy with other strategies that they deem similar, and they switch to the observed strategy with probability proportional to its payoff excess. Because of this implicit bias toward similar strategies, the resulting dynamics - which we call the nested replicator dynamics - do not satisfy any of the standard monotonicity postulates for imitative game dynamics; nonetheless, we show that they retain the main long-run rationality properties of the replicator dynamics, albeit at quantitatively different rates. We also show that the induced dynamics can be viewed as a stimulus-response model in the spirit of Erev & Roth (1998), with choice probabilities given by the nested logit choice rule of Ben-Akiva (1973) and McFadden (1978). This result generalizes an existing relation between the replicator dynamics and the exponential weights algorithm in online learning, and provides an additional layer of interpretation to our analysis and results.
- [165] arXiv:2407.17816 [pdf, html, other]
-
Title: NC-NCD: Novel Class Discovery for Node ClassificationComments: Accepted by CIKM'24Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Novel Class Discovery (NCD) involves identifying new categories within unlabeled data by utilizing knowledge acquired from previously established categories. However, existing NCD methods often struggle to maintain a balance between the performance of old and new categories. Discovering unlabeled new categories in a class-incremental way is more practical but also more challenging, as it is frequently hindered by either catastrophic forgetting of old categories or an inability to learn new ones. Furthermore, the implementation of NCD on continuously scalable graph-structured data remains an under-explored area. In response to these challenges, we introduce for the first time a more practical NCD scenario for node classification (i.e., NC-NCD), and propose a novel self-training framework with prototype replay and distillation called SWORD, adopted to our NC-NCD setting. Our approach enables the model to cluster unlabeled new category nodes after learning labeled nodes while preserving performance on old categories without reliance on old category nodes. SWORD achieves this by employing a self-training strategy to learn new categories and preventing the forgetting of old categories through the joint use of feature prototypes and knowledge distillation. Extensive experiments on four common benchmarks demonstrate the superiority of SWORD over other state-of-the-art methods.
- [166] arXiv:2407.17817 [pdf, html, other]
-
Title: Demystifying Verbatim Memorization in Large Language ModelsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. Much prior work has studied such verbatim memorization using observational data. To complement such work, we develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to verbatim memorize sequences, even for out-of-distribution sequences; (3) the generation of memorized sequences is triggered by distributed model states that encode high-level features and makes important use of general language modeling capabilities. Guided by these insights, we develop stress tests to evaluate unlearning methods and find they often fail to remove the verbatim memorized information, while also degrading the LM. Overall, these findings challenge the hypothesis that verbatim memorization stems from specific model weights or mechanisms. Rather, verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.
- [167] arXiv:2407.17822 [pdf, other]
-
Title: Advanced deep-reinforcement-learning methods for flow control: group-invariant and positional-encoding networks improve learning speed and qualitySubjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
Flow control is key to maximize energy efficiency in a wide range of applications. However, traditional flow-control methods face significant challenges in addressing non-linear systems and high-dimensional data, limiting their application in realistic energy systems. This study advances deep-reinforcement-learning (DRL) methods for flow control, particularly focusing on integrating group-invariant networks and positional encoding into DRL architectures. Our methods leverage multi-agent reinforcement learning (MARL) to exploit policy invariance in space, in combination with group-invariant networks to ensure local symmetry invariance. Additionally, a positional encoding inspired by the transformer architecture is incorporated to provide location information to the agents, mitigating action constraints from strict invariance. The proposed methods are verified using a case study of Rayleigh-Bénard convection, where the goal is to minimize the Nusselt number Nu. The group-invariant neural networks (GI-NNs) show faster convergence compared to the base MARL, achieving better average policy performance. The GI-NNs not only cut DRL training time in half but also notably enhance learning reproducibility. Positional encoding further enhances these results, effectively reducing the minimum Nu and stabilizing convergence. Interestingly, group invariant networks specialize in improving learning speed and positional encoding specializes in improving learning quality. These results demonstrate that choosing a suitable feature-representation method according to the purpose as well as the characteristics of each control problem is essential. We believe that the results of this study will not only inspire novel DRL methods with invariant and unique representations, but also provide useful insights for industrial applications.
- [168] arXiv:2407.17825 [pdf, html, other]
-
Title: Blockchain Takeovers in Web 3.0: An Empirical Study on the TRON-Steem IncidentSubjects: Social and Information Networks (cs.SI); Cryptography and Security (cs.CR)
A fundamental goal of Web 3.0 is to establish a decentralized network and application ecosystem, thereby enabling users to retain control over their data while promoting value exchange. However, the recent Tron-Steem takeover incident poses a significant threat to this vision. In this paper, we present a thorough empirical analysis of the Tron-Steem takeover incident. By conducting a fine-grained reconstruction of the stake and election snapshots within the Steem blockchain, one of the most prominent social-oriented blockchains, we quantify the marked shifts in decentralization pre and post the takeover incident, highlighting the severe threat that blockchain network takeovers pose to the decentralization principle of Web 3.0. Moreover, by employing heuristic methods to identify anomalous voters and conducting clustering analyses on voter behaviors, we unveil the underlying mechanics of takeover strategies employed in the Tron-Steem incident and suggest potential mitigation strategies, which contribute to the enhanced resistance of Web 3.0 networks against similar threats in the future. We believe the insights gleaned from this research help illuminate the challenges imposed by blockchain network takeovers in the Web 3.0 era, suggest ways to foster the development of decentralized technologies and governance, as well as to enhance the protection of Web 3.0 user rights.
- [169] arXiv:2407.17827 [pdf, html, other]
-
Title: Unified Lexical Representation for Interpretable Visual-Language AlignmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations is difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA.
- [170] arXiv:2407.17829 [pdf, html, other]
-
Title: Image Segmentation via Divisive Normalization: dealing with environmental diversityPablo Hernández-Cámara, Jorge Vila-Tomás, Paula Dauden-Oliver, Nuria Alabau-Bosque, Valero Laparra, Jesús MaloSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Autonomous driving is a challenging scenario for image segmentation due to the presence of uncontrolled environmental conditions and the eventually catastrophic consequences of failures. Previous work suggested that a biologically motivated computation, the so-called Divisive Normalization, could be useful to deal with image variability, but its effects have not been systematically studied over different data sources and environmental factors. Here we put segmentation U-nets augmented with Divisive Normalization to work far from training conditions to find where this adaptation is more critical. We categorize the scenes according to their radiance level and dynamic range (day/night), and according to their achromatic/chromatic contrasts. We also consider video game (synthetic) images to broaden the range of environments. We check the performance in the extreme percentiles of such categorization. Then, we push the limits further by artificially modifying the images in perceptually/environmentally relevant dimensions: luminance, contrasts and spectral radiance. Results show that neural networks with Divisive Normalization get better results in all the scenarios and their performance remains more stable with regard to the considered environmental factors and nature of the source. Finally, we explain the improvements in segmentation performance in two ways: (1) by quantifying the invariance of the responses that incorporate Divisive Normalization, and (2) by illustrating the adaptive nonlinearity of the different layers that depends on the local activity.
- [171] arXiv:2407.17834 [pdf, html, other]
-
Title: Towards the Spectral bias Alleviation by Normalizations in Coordinate NetworksSubjects: Computer Vision and Pattern Recognition (cs.CV)
Representing signals using coordinate networks dominates the area of inverse problems recently, and is widely applied in various scientific computing tasks. Still, there exists an issue of spectral bias in coordinate networks, limiting the capacity to learn high-frequency components. This problem is caused by the pathological distribution of the neural tangent kernel's (NTK's) eigenvalues of coordinate networks. We find that, this pathological distribution could be improved using classical normalization techniques (batch normalization and layer normalization), which are commonly used in convolutional neural networks but rarely used in coordinate networks. We prove that normalization techniques greatly reduces the maximum and variance of NTK's eigenvalues while slightly modifies the mean value, considering the max eigenvalue is much larger than the most, this variance change results in a shift of eigenvalues' distribution from a lower one to a higher one, therefore the spectral bias could be alleviated. Furthermore, we propose two new normalization techniques by combining these two techniques in different ways. The efficacy of these normalization techniques is substantiated by the significant improvements and new state-of-the-arts achieved by applying normalization-based coordinate networks to various tasks, including the image compression, computed tomography reconstruction, shape representation, magnetic resonance imaging, novel view synthesis and multi-view stereo reconstruction.
- [172] arXiv:2407.17835 [pdf, html, other]
-
Title: IsUMap: Manifold Learning and Data Visualization leveraging Vietoris-Rips filtrationsSubjects: Machine Learning (cs.LG); Category Theory (math.CT); Differential Geometry (math.DG); Metric Geometry (math.MG)
This work introduces IsUMap, a novel manifold learning technique that enhances data representation by integrating aspects of UMAP and Isomap with Vietoris-Rips filtrations. We present a systematic and detailed construction of a metric representation for locally distorted metric spaces that captures complex data structures more accurately than the previous schemes. Our approach addresses limitations in existing methods by accommodating non-uniform data distributions and intricate local geometries. We validate its performance through extensive experiments on examples of various geometric objects and benchmark real-world datasets, demonstrating significant improvements in representation quality.
- [173] arXiv:2407.17838 [pdf, other]
-
Title: UMono: Physical Model Informed Hybrid CNN-Transformer Framework for Underwater Monocular Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Underwater monocular depth estimation serves as the foundation for tasks such as 3D reconstruction of underwater scenes. However, due to the influence of light and medium, the underwater environment undergoes a distinctive imaging process, which presents challenges in accurately estimating depth from a single image. The existing methods fail to consider the unique characteristics of underwater environments, leading to inadequate estimation results and limited generalization performance. Furthermore, underwater depth estimation requires extracting and fusing both local and global features, which is not fully explored in existing methods. In this paper, an end-to-end learning framework for underwater monocular depth estimation called UMono is presented, which incorporates underwater image formation model characteristics into network architecture, and effectively utilize both local and global features of underwater image. Experimental results demonstrate that the proposed method is effective for underwater monocular depth estimation and outperforms the existing methods in both quantitative and qualitative analyses.
- [174] arXiv:2407.17839 [pdf, html, other]
-
Title: Long-term Fairness in Ride-Hailing PlatformComments: Accepted by ECML PKDD 2024Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Matching in two-sided markets such as ride-hailing has recently received significant attention. However, existing studies on ride-hailing mainly focus on optimising efficiency, and fairness issues in ride-hailing have been neglected. Fairness issues in ride-hailing, including significant earning differences between drivers and variance of passenger waiting times among different locations, have potential impacts on economic and ethical aspects. The recent studies that focus on fairness in ride-hailing exploit traditional optimisation methods and the Markov Decision Process to balance efficiency and fairness. However, there are several issues in these existing studies, such as myopic short-term decision-making from traditional optimisation and instability of fairness in a comparably longer horizon from both traditional optimisation and Markov Decision Process-based methods. To address these issues, we propose a dynamic Markov Decision Process model to alleviate fairness issues currently faced by ride-hailing, and seek a balance between efficiency and fairness, with two distinct characteristics: (i) a prediction module to predict the number of requests that will be raised in the future from different locations to allow the proposed method to consider long-term fairness based on the whole timeline instead of consider fairness only based on historical and current data patterns; (ii) a customised scalarisation function for multi-objective multi-agent Q Learning that aims to balance efficiency and fairness. Extensive experiments on a publicly available real-world dataset demonstrate that our proposed method outperforms existing state-of-the-art methods.
- [175] arXiv:2407.17840 [pdf, other]
-
Title: Complex picking via entanglement of granular mechanical metamaterialsSubjects: Robotics (cs.RO); Applied Physics (physics.app-ph)
When objects are packed in a cluster, physical interactions are unavoidable. Such interactions emerge because of the objects geometric features; some of these features promote entanglement, while others create repulsion. When entanglement occurs, the cluster exhibits a global, complex behaviour, which arises from the stochastic interactions between objects. We hereby refer to such a cluster as an entangled granular metamaterial. We investigate the geometrical features of the objects which make up the cluster, henceforth referred to as grains, that maximise entanglement. We hypothesise that a cluster composed from grains with high propensity to tangle, will also show propensity to interact with a second cluster of tangled objects. To demonstrate this, we use the entangled granular metamaterials to perform complex robotic picking tasks, where conventional grippers struggle. We employ an electromagnet to attract the metamaterial (ferromagnetic) and drop it onto a second cluster of objects (targets, non-ferromagnetic). When the electromagnet is re-activated, the entanglement ensures that both the metamaterial and the targets are picked, with varying degrees of physical engagement that strongly depend on geometric features. Interestingly, although the metamaterials structural arrangement is random, it creates repeatable and consistent interactions with a second tangled media, enabling robust picking of the latter.
- [176] arXiv:2407.17841 [pdf, html, other]
-
Title: Two-Timescale Design for Movable Antenna Array-Enabled Multiuser Uplink CommunicationsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Movable antenna (MA) technology can flexibly reconfigure wireless channels by adjusting antenna positions in a local region, thus owing great potential for enhancing communication performance. This letter investigates MA technology enabled multiuser uplink communications over general Rician fading channels, which consist of a base station (BS) equipped with the MA array and multiple single-antenna users. Since it is practically challenging to collect all instantaneous channel state information (CSI) by traversing all possible antenna positions at the BS, we instead propose a two-timescale scheme for maximizing the ergodic sum rate. Specifically, antenna positions at the BS are first optimized using only the statistical CSI. Subsequently, the receiving beamforming at the BS (for which we consider the three typical zero-forcing (ZF), minimum mean-square error (MMSE) and MMSE with successive interference cancellation (MMSE-SIC) receivers) is designed based on the instantaneous CSI with optimized antenna positions, thus significantly reducing practical implementation complexities. The formulated problems are highly non-convex and we develop projected gradient ascent (PGA) algorithms to effectively handle them. Simulation results illustrate that compared to conventional fixed-position antenna (FPA) array, the MA array can achieve significant performance gains by reaping an additional spatial degree of freedom.
- [177] arXiv:2407.17842 [pdf, html, other]
-
Title: On the Opportunities of (Re)-Exploring Atmospheric Science by Foundation Models: A Case StudyComments: 28 pages, 12 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Most state-of-the-art AI applications in atmospheric science are based on classic deep learning approaches. However, such approaches cannot automatically integrate multiple complicated procedures to construct an intelligent agent, since each functionality is enabled by a separate model learned from independent climate datasets. The emergence of foundation models, especially multimodal foundation models, with their ability to process heterogeneous input data and execute complex tasks, offers a substantial opportunity to overcome this challenge. In this report, we want to explore a central question - how the state-of-the-art foundation model, i.e., GPT-4o, performs various atmospheric scientific tasks. Toward this end, we conduct a case study by categorizing the tasks into four main classes, including climate data processing, physical diagnosis, forecast and prediction, and adaptation and mitigation. For each task, we comprehensively evaluate the GPT-4o's performance along with a concrete discussion. We hope that this report may shed new light on future AI applications and research in atmospheric science.
- [178] arXiv:2407.17843 [pdf, html, other]
-
Title: DragText: Rethinking Text Embedding in Point-based Image EditingComments: 22 pages, 18 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding in the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. In this study, we show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant. As the image embedding increasingly diverges from its initial state, the discrepancy between the image and text embeddings presents a significant challenge. Moreover, we found that the text prompt significantly influences the dragging process, particularly in maintaining content integrity and achieving the desired manipulation. To utilize these insights, we propose DragText, which optimizes text embedding in conjunction with the dragging process to pair with the modified image embedding. Simultaneously, we regularize the text optimization process to preserve the integrity of the original text prompt. Our approach can be seamlessly integrated with existing diffusion-based drag methods with only a few lines of code.
- [179] arXiv:2407.17844 [pdf, other]
-
Title: Innovative Speech-Based Deep Learning Approaches for Parkinson's Disease Classification: A Systematic ReviewComments: Submitted in Applied Sciences - peer reviewed Open Access journal. This research was funded by the NWO research programme AiNed Fellowship Grants under the project Responsible AI for Voice Diagnostics (RAIVD) - grant number NGF.1607.22.013Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Parkinson's disease (PD), the second most prevalent neurodegenerative disorder worldwide, frequently presents with early-stage speech impairments. Recent advancements in Artificial Intelligence (AI), particularly deep learning (DL), have significantly enhanced PD diagnosis through the analysis of speech data. Nevertheless, the progress of research is restricted by the limited availability of publicly accessible speech-based PD datasets, primarily due to privacy and ethical concerns. This review covers the latest DL-based AI approaches for speech-based PD classification, focusing on performance, available resources and associated challenges of 33 scientific works published between 2020 and March 2024. These DL approaches are categorized into end-to-end (E2E) learning, transfer learning (TL) and deep acoustic features (DAF) extraction. Among E2E approaches, Convolutional Neural Networks (CNNs) are prevalent, though Transformers are increasingly popular. E2E approaches face challenges such as limited data and computational resources, especially with Transformers. TL addresses these issues by providing more robust PD diagnosis and better generalizability across languages. DAF extraction aims to improve the explainability and interpretability of results by examining the specific effects of deep features on both other DL approaches and more traditional machine learning (ML) methods. However, it often underperforms compared to E2E and TL approaches. This review also discusses unresolved issues related to bias, explainability and privacy, highlighting the need for future research.
- [180] arXiv:2407.17847 [pdf, html, other]
-
Title: Move and Act: Enhanced Object Manipulation and Background Integrity for Image EditingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Current methods commonly utilize three-branch structures of inversion, reconstruction, and editing, to tackle consistent image editing task. However, these methods lack control over the generation position of the edited object and have issues with background preservation. To overcome these limitations, we propose a tuning-free method with only two branches: inversion and editing. This approach allows users to simultaneously edit the object's action and control the generation position of the edited object. Additionally, it achieves improved background preservation. Specifically, we transfer the edited object information to the target area and repair or preserve the background of other areas during the inversion process at a specific time step. In the editing stage, we use the image features in self-attention to query the key and value of the corresponding time step in the inversion to achieve consistent image editing. Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method. The code is available at this https URL.
- [181] arXiv:2407.17850 [pdf, html, other]
-
Title: FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid EditingComments: ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Current image editing methods primarily utilize DDIM Inversion, employing a two-branch diffusion approach to preserve the attributes and layout of the original image. However, these methods encounter challenges with non-rigid edits, which involve altering the image's layout or structure. Our comprehensive analysis reveals that the high-frequency components of DDIM latent, crucial for retaining the original image's key features and layout, significantly contribute to these limitations. Addressing this, we introduce FlexiEdit, which enhances fidelity to input text prompts by refining DDIM latent, by reducing high-frequency components in targeted editing areas. FlexiEdit comprises two key components: (1) Latent Refinement, which modifies DDIM latent to better accommodate layout adjustments, and (2) Edit Fidelity Enhancement via Re-inversion, aimed at ensuring the edits more accurately reflect the input text prompts. Our approach represents notable progress in image editing, particularly in performing complex non-rigid edits, showcasing its enhanced capability through comparative experiments.
- [182] arXiv:2407.17852 [pdf, html, other]
-
Title: Scaling A Simple Approach to Zero-Shot Speech RecognitionComments: 9 pagesSubjects: Computation and Language (cs.CL)
Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.
- [183] arXiv:2407.17853 [pdf, html, other]
-
Title: Compilation of Commit Changes within Java Source Code RepositoriesComments: To be published in: ICSME 2024 ProceedingsSubjects: Software Engineering (cs.SE); Programming Languages (cs.PL)
Java applications include third-party dependencies as bytecode. To keep these applications secure, researchers have proposed tools to re-identify dependencies that contain known vulnerabilities. Yet, to allow such re-identification, one must obtain, for each vulnerability patch, the bytecode fixing the respective vulnerability at first. Such patches for dependencies are curated in databases in the form of fix-commits. But fixcommits are in source code, and automatically compiling whole Java projects to bytecode is notoriously hard, particularly for non-current versions of the code. In this paper, we thus propose JESS, an approach that largely avoids this problem by compiling solely the relevant code that was modified within a given commit. JESS reduces the code, retaining only those parts that the committed change references. To avoid name-resolution errors, JESS automatically infers stubs for references to entities that are unavailable to the compiler. A challenge is here that, to facilitate the above mentioned reidentification, JESS must seek to produce bytecode that is almost identical to the bytecode which one would obtain by a successful compilation of the full project. An evaluation on 347 GitHub projects shows that JESS is able to compile, in isolation, 72% of methods and constructors, of which 89% have bytecode equal to the original one. Furthermore, on the Project KB database of fix-commits, in which only 8% of files modified within the commits can be compiled with the provided build scripts, JESS is able to compile 73% of all files that these commits modify.
- [184] arXiv:2407.17854 [pdf, html, other]
-
Title: Shapley Value-based Contrastive Alignment for Multimodal Information ExtractionComments: Accepted at ACM Multimedia 2024Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
The rise of social media and the exponential growth of multimodal communication necessitates advanced techniques for Multimodal Information Extraction (MIE). However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often faces significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. In line with this paradigm, we propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method, which aligns both context-text and context-image pairs. Shap-CA initially applies the Shapley value concept from cooperative game theory to assess the individual contribution of each element in the set of contexts, texts and images towards total semantic and modality overlaps. Following this quantitative evaluation, a contrastive learning strategy is employed to enhance the interactive contribution within context-text/image pairs, while minimizing the influence across these pairs. Furthermore, we design an adaptive fusion module for selective cross-modal fusion. Extensive experiments across four MIE datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.
- [185] arXiv:2407.17856 [pdf, html, other]
-
Title: MDS-ED: Multimodal Decision Support in the Emergency Department -- a Benchmark Dataset for Diagnoses and Deterioration Prediction in Emergency MedicineComments: 14 pages, 1 figure, code available under this https URLSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Background: Benchmarking medical decision support algorithms often struggles due to limited access to datasets, narrow prediction tasks, and restricted input modalities. These limitations affect their clinical relevance and performance in high-stakes areas like emergency care, complicating replication, validation, and improvement of benchmarks.
Methods: We introduce a dataset based on MIMIC-IV, benchmarking protocol, and initial results for evaluating multimodal decision support in the emergency department (ED). We use diverse data modalities from the first 1.5 hours of patient arrival, including demographics, biometrics, vital signs, lab values, and electrocardiogram waveforms. We analyze 1443 clinical labels across two contexts: predicting diagnoses with ICD-10 codes and forecasting patient deterioration.
Results: Our multimodal diagnostic model achieves an AUROC score over 0.8 in a statistically significant manner for 357 out of 1428 conditions, including cardiac issues like myocardial infarction and non-cardiac conditions such as renal disease and diabetes. The deterioration model scores above 0.8 in a statistically significant manner for 13 out of 15 targets, including critical events like cardiac arrest and mechanical ventilation, ICU admission as well as short- and long-term mortality. Incorporating raw waveform data significantly improves model performance, which represents one of the first robust demonstrations of this effect.
Conclusions: This study highlights the uniqueness of our dataset, which encompasses a wide range of clinical tasks and utilizes a comprehensive set of features collected early during the emergency after arriving at the ED. The strong performance, as evidenced by high AUROC scores across diagnostic and deterioration targets, underscores the potential of our approach to revolutionize decision-making in acute and emergency medicine. - [186] arXiv:2407.17857 [pdf, html, other]
-
Title: Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex NetworkComments: ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advancements in graph-based approaches for multiplexed immunofluorescence (mIF) images have significantly propelled the field forward, offering deeper insights into patient-level phenotyping. However, current graph-based methodologies encounter two primary challenges: (1) Cellular Heterogeneity, where existing approaches fail to adequately address the inductive biases inherent in graphs, particularly the homophily characteristic observed in cellular connectivity and; (2) Scalability, where handling cellular graphs from high-dimensional images faces difficulties in managing a high number of cells. To overcome these limitations, we introduce Mew, a novel framework designed to efficiently process mIF images through the lens of multiplex network. Mew innovatively constructs a multiplex network comprising two distinct layers: a Voronoi network for geometric information and a Cell-type network for capturing cell-wise homogeneity. This framework equips a scalable and efficient Graph Neural Network (GNN), capable of processing the entire graph during training. Furthermore, Mew integrates an interpretable attention module that autonomously identifies relevant layers for image classification. Extensive experiments on a real-world patient dataset from various institutions highlight Mew's remarkable efficacy and efficiency, marking a significant advancement in mIF image analysis. The source code of Mew can be found here: \url{this https URL}
- [187] arXiv:2407.17858 [pdf, html, other]
-
Title: 3D Adaptive VEM with stabilization-free a posteriori error boundsSubjects: Numerical Analysis (math.NA)
The present paper extends the theory of Adaptive Virtual Element Methods (AVEMs) to the three-dimensional meshes showing the possibility to bound the stabilization term by the residual-type error estimator. This new bound enables a stabilization-free a posteriori control for the energy error. Following the recent studies for the bi-dimensional case, we investigate the case of tetrahedral elements with aligned edges and faces. We believe that the AVEMs can be an efficient strategy to address the mesh conforming requirements of standard three-dimensional Adaptive Finite Element Methods (AFEMs), which typically extend the refinement procedure to non-marked mesh cells. Indeed, numerical tests on the Fichera corner shape domain show that this method can reduce the number of three-dimensional cells generated in the refinement process by about 30% with compared to standard AFEMs, for a given error threshold.
- [188] arXiv:2407.17862 [pdf, html, other]
-
Title: Exploring Description-Augmented Dataless Intent ClassificationComments: Accepted to the 6th NLP for Conversational AI Workshop at ACL 2024(NLP4ConvAI)Subjects: Computation and Language (cs.CL)
In this work, we introduce several schemes to leverage description-augmented embedding similarity for dataless intent classification using current state-of-the-art (SOTA) text embedding models. We report results of our methods on four commonly used intent classification datasets and compare against previous works of a similar nature. Our work shows promising results for dataless classification scaling to a large number of unseen intents. We show competitive results and significant improvements (+6.12\% Avg.) over strong zero-shot baselines, all without training on labelled or task-specific data. Furthermore, we provide qualitative error analysis of the shortfalls of this methodology to help guide future research in this area.
- [189] arXiv:2407.17863 [pdf, html, other]
-
Title: factgenie: A Framework for Span-based Evaluation of Generated TextsComments: Accepted to INLG 2024 (System Demonstrations)Subjects: Computation and Language (cs.CL)
We present factgenie: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With factgenie, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.
- [190] arXiv:2407.17869 [pdf, html, other]
-
Title: EllipBench: A Large-scale Benchmark for Machine-learning based Ellipsometry ModelingSubjects: Machine Learning (cs.LG)
Ellipsometry is used to indirectly measure the optical properties and thickness of thin films. However, solving the inverse problem of ellipsometry is time-consuming since it involves human expertise to apply the data fitting techniques. Many studies use traditional machine learning-based methods to model the complex mathematical fitting process. In our work, we approach this problem from a deep learning perspective. First, we introduce a large-scale benchmark dataset to facilitate deep learning methods. The proposed dataset encompasses 98 types of thin film materials and 4 types of substrate materials, including metals, alloys, compounds, and polymers, among others. Additionally, we propose a deep learning framework that leverages residual connections and self-attention mechanisms to learn the massive data points. We also introduce a reconstruction loss to address the common challenge of multiple solutions in thin film thickness prediction. Compared to traditional machine learning methods, our framework achieves state-of-the-art (SOTA) performance on our proposed dataset. The dataset and code will be available upon acceptance.
- [191] arXiv:2407.17870 [pdf, html, other]
-
Title: Is the Digital Forensics and Incident Response Pipeline Ready for Text-Based Threats in LLM Era?Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
In the era of generative AI, the widespread adoption of Neural Text Generators (NTGs) presents new cybersecurity challenges, particularly within the realms of Digital Forensics and Incident Response (DFIR). These challenges primarily involve the detection and attribution of sources behind advanced attacks like spearphishing and disinformation campaigns. As NTGs evolve, the task of distinguishing between human and NTG-authored texts becomes critically complex. This paper rigorously evaluates the DFIR pipeline tailored for text-based security systems, specifically focusing on the challenges of detecting and attributing authorship of NTG-authored texts. By introducing a novel human-NTG co-authorship text attack, termed CS-ACT, our study uncovers significant vulnerabilities in traditional DFIR methodologies, highlighting discrepancies between ideal scenarios and real-world conditions. Utilizing 14 diverse datasets and 43 unique NTGs, up to the latest GPT-4, our research identifies substantial vulnerabilities in the forensic profiling phase, particularly in attributing authorship to NTGs. Our comprehensive evaluation points to factors such as model sophistication and the lack of distinctive style within NTGs as significant contributors for these vulnerabilities. Our findings underscore the necessity for more sophisticated and adaptable strategies, such as incorporating adversarial learning, stylizing NTGs, and implementing hierarchical attribution through the mapping of NTG lineages to enhance source attribution. This sets the stage for future research and the development of more resilient text-based security systems.
- [192] arXiv:2407.17874 [pdf, html, other]
-
Title: Improving Domain-Specific ASR with LLM-Generated Contextual DescriptionsComments: Accepted to INTERSPEECH 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
End-to-end automatic speech recognition (E2E ASR) systems have significantly improved speech recognition through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. Moreover, we propose two additional training techniques to improve the domain specific ASR: decoder fine-tuning, and context perturbation. We also propose a method to use a Large Language Model (LLM) to generate descriptions with simple metadata, when descriptions are unavailable. Our experiments demonstrate that proposed methods notably enhance domain-specific ASR accuracy on real-life datasets, with LLM-generated descriptions outperforming human-crafted ones in effectiveness.
- [193] arXiv:2407.17875 [pdf, html, other]
-
Title: Overcoming Binary Adversarial Optimisation with Competitive CoevolutionComments: 38 pages, Accepted at the 18th International Conference on Parallel Problem Solving From Nature (PPSN 2024)Subjects: Neural and Evolutionary Computing (cs.NE)
Co-evolutionary algorithms (CoEAs), which pair candidate designs with test cases, are frequently used in adversarial optimisation, particularly for binary test-based problems where designs and tests yield binary outcomes. The effectiveness of designs is determined by their performance against tests, and the value of tests is based on their ability to identify failing designs, often leading to more sophisticated tests and improved designs. However, CoEAs can exhibit complex, sometimes pathological behaviours like disengagement. Through runtime analysis, we aim to rigorously analyse whether CoEAs can efficiently solve test-based adversarial optimisation problems in an expected polynomial runtime.
This paper carries out the first rigorous runtime analysis of $(1,\lambda)$ CoEA for binary test-based adversarial optimisation problems. In particular, we introduce a binary test-based benchmark problem called \Diagonal problem and initiate the first runtime analysis of competitive CoEA on this problem. The mathematical analysis shows that the $(1,\lambda)$-CoEA can efficiently find an $\varepsilon$ approximation to the optimal solution of the \Diagonal problem, i.e. in expected polynomial runtime assuming sufficiently low mutation rates and large offspring population size. On the other hand, the standard $(1,\lambda)$-EA fails to find an $\varepsilon$ approximation to the optimal solution of the \Diagonal problem in polynomial runtime. This suggests the promising potential of coevolution for solving binary adversarial optimisation problems. - [194] arXiv:2407.17876 [pdf, html, other]
-
Title: A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text SpatializationsComments: To be published at IEEE VIS 2024 conferenceSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at this https URL and results as Zenodo archive at this https URL.
- [195] arXiv:2407.17877 [pdf, html, other]
-
Title: Advancing 3D Point Cloud Understanding through Deep Transfer Learning: A Comprehensive SurveyShahab Saquib Sohail, Yassine Himeur, Hamza Kheddar, Abbes Amira, Fodil Fadli, Shadi Atalla, Abigail Copiaco, Wathiq MansoorComments: 55 pages, 9 tables, and 15 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
The 3D point cloud (3DPC) has significantly evolved and benefited from the advance of deep learning (DL). However, the latter faces various issues, including the lack of data or annotated data, the existence of a significant gap between training data and test data, and the requirement for high computational resources. To that end, deep transfer learning (DTL), which decreases dependency and costs by utilizing knowledge gained from a source data/task in training a target data/task, has been widely investigated. Numerous DTL frameworks have been suggested for aligning point clouds obtained from several scans of the same scene. Additionally, DA, which is a subset of DTL, has been modified to enhance the point cloud data's quality by dealing with noise and missing points. Ultimately, fine-tuning and DA approaches have demonstrated their effectiveness in addressing the distinct difficulties inherent in point cloud data. This paper presents the first review shedding light on this aspect. it provides a comprehensive overview of the latest techniques for understanding 3DPC using DTL and domain adaptation (DA). Accordingly, DTL's background is first presented along with the datasets and evaluation metrics. A well-defined taxonomy is introduced, and detailed comparisons are presented, considering different aspects such as different knowledge transfer strategies, and performance. The paper covers various applications, such as 3DPC object detection, semantic labeling, segmentation, classification, registration, downsampling/upsampling, and denoising. Furthermore, the article discusses the advantages and limitations of the presented frameworks, identifies open challenges, and suggests potential research directions.
- [196] arXiv:2407.17879 [pdf, html, other]
-
Title: HG-PIPE: Vision Transformer Acceleration with Hybrid-Grained PipelineComments: Accepted by ICCAD 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Vision Transformer (ViT) acceleration with field programmable gate array (FPGA) is promising but challenging. Existing FPGA-based ViT accelerators mainly rely on temporal architectures, which process different operators by reusing the same hardware blocks and suffer from extensive memory access overhead. Pipelined architectures, either coarse-grained or fine-grained, unroll the ViT computation spatially for memory access efficiency. However, they usually suffer from significant hardware resource constraints and pipeline bubbles induced by the global computation dependency of ViT. In this paper, we introduce HG-PIPE, a pipelined FPGA accelerator for high-throughput and low-latency ViT processing. HG-PIPE features a hybrid-grained pipeline architecture to reduce on-chip buffer cost and couples the computation dataflow and parallelism design to eliminate the pipeline bubbles. HG-PIPE further introduces careful approximations to implement both linear and non-linear operators with abundant Lookup Tables (LUTs), thus alleviating resource constraints. On a ZCU102 FPGA, HG-PIPE achieves 2.78 times better throughput and 2.52 times better resource efficiency than the prior-art accelerators, e.g., AutoViTAcc. With a VCK190 FPGA, HG-PIPE realizes end-to-end ViT acceleration on a single device and achieves 7118 images/s, which is 2.81 times faster than a V100 GPU.
- [197] arXiv:2407.17880 [pdf, other]
-
Title: DAM: Towards A Foundation Model for Time Series ForecastingLuke Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker, Amos StorkeySubjects: Machine Learning (cs.LG)
It is challenging to scale time series forecasting models such that they forecast accurately for multiple distinct domains and datasets, all with potentially different underlying collection procedures (e.g., sample resolution), patterns (e.g., periodicity), and prediction requirements (e.g., reconstruction vs. forecasting). We call this general task universal forecasting. Existing methods usually assume that input data is regularly sampled, and they forecast to pre-determined horizons, resulting in failure to generalise outside of the scope of their training. We propose the DAM - a neural model that takes randomly sampled histories and outputs an adjustable basis composition as a continuous function of time for forecasting to non-fixed horizons. It involves three key components: (1) a flexible approach for using randomly sampled histories from a long-tail distribution, that enables an efficient global perspective of the underlying temporal dynamics while retaining focus on the recent history; (2) a transformer backbone that is trained on these actively sampled histories to produce, as representational output, (3) the basis coefficients of a continuous function of time. We show that a single univariate DAM, trained on 25 time series datasets, either outperformed or closely matched existing SoTA models at multivariate long-term forecasting across 18 datasets, including 8 held-out for zero-shot transfer, even though these models were trained to specialise for each dataset-horizon combination. This single DAM excels at zero-shot transfer and very-long-term forecasting, performs well at imputation, is interpretable via basis function composition and attention, can be tuned for different inference-cost requirements, is robust to missing and irregularly sampled data {by design}.
- [198] arXiv:2407.17881 [pdf, html, other]
-
Title: Unraveling the Never-Ending Story of Lifecycles and Vitalizing ProcessesSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Business process management (BPM) has been widely used to discover, model, analyze, and optimize organizational processes. BPM looks at these processes with analysis techniques that assume a clearly defined start and end. However, not all processes adhere to this logic, with the consequence that their behavior cannot be appropriately captured by BPM analysis techniques. This paper addresses this research problem at a conceptual level. More specifically, we introduce the notion of vitalizing business processes that target the lifecycle process of one or more entities. We show the existence of lifecycle processes in many industries and that their appropriate conceptualizations pave the way for the definition of suitable modeling and analysis techniques. This paper provides a set of requirements for their analysis, and a conceptualization of lifecycle and vitalizing processes.
- [199] arXiv:2407.17889 [pdf, other]
-
Title: An Error Discovery and Correction for the Family of V-Shaped BPSO AlgorithmsComments: 25 pages, 11 figuresSubjects: Neural and Evolutionary Computing (cs.NE)
BPSO algorithm is a swarm intelligence optimization algorithm, which has the characteristics of good optimization effect, high efficiency and easy to implement. In recent years, it has been used to optimize a variety of machine learning and deep learning models, such as CNN, LSTM, SVM, etc. But it is easy to fall into local optimum for the lack of exploitation ability. It is found that in the article, which is different from previous studies, The reason for the poor performance is an error existing in their velocity update function, which leads to abnormal and chaotic behavior of particles. This not only makes the algorithm difficult to converge, but also often searches the repeated space. So, traditionally, it has to rely on a low w value in the later stage to force these algorithms to converge, but also makes them quickly lose their search ability and prone to getting trapped in local optima. This article proposes a velocity legacy term correction method for all V-shaped BPSOs. Experimentals based on 0/1 knapsack problems show that it has a significant effect on accuracy and efficiency for all of the 4 commonly used V-Shaped BPSOs. Therefore it is an significant breakthrough in the field of swarm intelligence.
- [200] arXiv:2407.17892 [pdf, html, other]
-
Title: An Iterative Approach to Topic ModellingAlbert Wong, Florence Wing Yau Cheng, Ashley Keung, Yamileth Hercules, Mary Alexandra Garcia, Yew-Wei Lim, Lien PhamSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Topic modelling has become increasingly popular for summarizing text data, such as social media posts and articles. However, topic modelling is usually completed in one shot. Assessing the quality of resulting topics is challenging. No effective methods or measures have been developed for assessing the results or for making further enhancements to the topics. In this research, we propose we propose to use an iterative process to perform topic modelling that gives rise to a sense of completeness of the resulting topics when the process is complete. Using the BERTopic package, a popular method in topic modelling, we demonstrate how the modelling process can be applied iteratively to arrive at a set of topics that could not be further improved upon using one of the three selected measures for clustering comparison as the decision criteria. This demonstration is conducted using a subset of the COVIDSenti-A dataset. The early success leads us to believe that further research using in using this approach in conjunction with other topic modelling algorithms could be viable.
- [201] arXiv:2407.17893 [pdf, html, other]
-
Title: Micro Visualizations on a Smartwatch: Assessing Reading Performance While WalkingSubjects: Human-Computer Interaction (cs.HC)
With two studies, we assess how different walking trajectories (straight line, circular, and infinity) and speeds (2 km/h, 4 km/h, and 6 km/h) influence the accuracy and response time of participants reading micro visualizations on a smartwatch. We showed our participants common watch face micro visualizations including date, time, weather information, and four complications showing progress charts of fitness data. Our findings suggest that while walking trajectories did not significantly affect reading performance, overall walking activity, especially at high speeds, hurt reading accuracy and, to some extent, response time.
- [202] arXiv:2407.17896 [pdf, html, other]
-
Title: 3D Hole Filling using Deep Learning InpaintingComments: 20 pages, 12 figures, to be submitted to Computers & Graphics JournalSubjects: Graphics (cs.GR); Artificial Intelligence (cs.AI)
The current work presents a novel methodology for completing 3D surfaces produced from 3D digitization technologies in places where there is a scarcity of meaningful geometric data. Incomplete or missing data in these three-dimensional (3D) models can lead to erroneous or flawed renderings, limiting their usefulness in a variety of applications such as visualization, geometric computation, and 3D printing. Conventional surface estimation approaches often produce implausible results, especially when dealing with complex surfaces. To address this issue, we propose a technique that incorporates neural network-based 2D inpainting to effectively reconstruct 3D surfaces. Our customized neural networks were trained on a dataset containing over 1 million curvature images. These images show the curvature of vertices as planar representations in 2D. Furthermore, we used a coarse-to-fine surface deformation technique to improve the accuracy of the reconstructed pictures and assure surface adaptability. This strategy enables the system to learn and generalize patterns from input data, resulting in the development of precise and comprehensive three-dimensional surfaces. Our methodology excels in the shape completion process, effectively filling complex holes in three-dimensional surfaces with a remarkable level of realism and precision.
- [203] arXiv:2407.17900 [pdf, html, other]
-
Title: The Power of Combining Data and Knowledge: GPT-4o is an Effective Interpreter of Machine Learning Models in Predicting Lymph Node Metastasis of Lung CancerSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Lymph node metastasis (LNM) is a crucial factor in determining the initial treatment for patients with lung cancer, yet accurate preoperative diagnosis of LNM remains challenging. Recently, large language models (LLMs) have garnered significant attention due to their remarkable text generation capabilities. Leveraging the extensive medical knowledge learned from vast corpora, LLMs can estimate probabilities for clinical problems, though their performance has historically been inferior to data-driven machine learning models. In this paper, we propose a novel ensemble method that combines the medical knowledge acquired by LLMs with the latent patterns identified by machine learning models to enhance LNM prediction performance. Initially, we developed machine learning models using patient data. We then designed a prompt template to integrate the patient data with the predicted probability from the machine learning model. Subsequently, we instructed GPT-4o, the most advanced LLM developed by OpenAI, to estimate the likelihood of LNM based on patient data and then adjust the estimate using the machine learning output. Finally, we collected three outputs from the GPT-4o using the same prompt and ensembled these results as the final prediction. Using the proposed method, our models achieved an AUC value of 0.765 and an AP value of 0.415 for LNM prediction, significantly improving predictive performance compared to baseline machine learning models. The experimental results indicate that GPT-4o can effectively leverage its medical knowledge and the probabilities predicted by machine learning models to achieve more accurate LNM predictions. These findings demonstrate that LLMs can perform well in clinical risk prediction tasks, offering a new paradigm for integrating medical knowledge and patient data in clinical predictions.
- [204] arXiv:2407.17904 [pdf, html, other]
-
Title: Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer VisionTim J.M. Jaspers, Ronald L.P.D. de Jong, Yasmina Al Khalil, Tijn Zeelenberg, Carolus H.J. Kusters, Yiping Li, Romy C. van Jaarsveld, Franciscus H.A. Bakker, Jelle P. Ruurda, Willem M. Brinkman, Peter H.N. De With, Fons van der SommenComments: accepted - Data Engineering in Medical Imaging (DEMI) Workshop @ MICCAI2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Over the past decade, computer vision applications in minimally invasive surgery have rapidly increased. Despite this growth, the impact of surgical computer vision remains limited compared to other medical fields like pathology and radiology, primarily due to the scarcity of representative annotated data. Whereas transfer learning from large annotated datasets such as ImageNet has been conventionally the norm to achieve high-performing models, recent advancements in self-supervised learning (SSL) have demonstrated superior performance. In medical image analysis, in-domain SSL pretraining has already been shown to outperform ImageNet-based initialization. Although unlabeled data in the field of surgical computer vision is abundant, the diversity within this data is limited. This study investigates the role of dataset diversity in SSL for surgical computer vision, comparing procedure-specific datasets against a more heterogeneous general surgical dataset across three different downstream surgical applications. The obtained results show that using solely procedure-specific data can lead to substantial improvements of 13.8%, 9.5%, and 36.8% compared to ImageNet pretraining. However, extending this data with more heterogeneous surgical data further increases performance by an additional 5.0%, 5.2%, and 2.5%, suggesting that increasing diversity within SSL data is beneficial for model performance. The code and pretrained model weights are made publicly available at this https URL.
- [205] arXiv:2407.17905 [pdf, html, other]
-
Title: StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span MemoryComments: 8 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may cause inconsistent segmentation results for the same object in different frames. To overcome this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior of moving objects and adopted to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine the present forecast at voxel and instance levels through voting. Besides, we present multi-view encoder with cascade projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. Code will be released at this https URL.
- [206] arXiv:2407.17906 [pdf, html, other]
-
Title: Hierarchical Object Detection and Recognition Framework for Practical Plant Disease DiagnosisComments: 6 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, object detection methods (OD; e.g., YOLO-based models) have been widely utilized in plant disease diagnosis. These methods demonstrate robustness to distance variations and excel at detecting small lesions compared to classification methods (CL; e.g., CNN models). However, there are issues such as low diagnostic performance for hard-to-detect diseases and high labeling costs. Additionally, since healthy cases cannot be explicitly trained, there is a risk of false positives. We propose the Hierarchical object detection and recognition framework (HODRF), a sophisticated and highly integrated two-stage system that combines the strengths of both OD and CL for plant disease diagnosis. In the first stage, HODRF uses OD to identify regions of interest (ROIs) without specifying the disease. In the second stage, CL diagnoses diseases surrounding the ROIs. HODRF offers several advantages: (1) Since OD detects only one type of ROI, HODRF can detect diseases with limited training images by leveraging its ability to identify other lesions. (2) While OD over-detects healthy cases, HODRF significantly reduces these errors by using CL in the second stage. (3) CL's accuracy improves in HODRF as it identifies diagnostic targets given as ROIs, making it less vulnerable to size changes. (4) HODRF benefits from CL's lower annotation costs, allowing it to learn from a larger number of images. We implemented HODRF using YOLOv7 for OD and EfficientNetV2 for CL and evaluated its performance on a large-scale dataset (4 crops, 20 diseased and healthy classes, 281K images). HODRF outperformed YOLOv7 alone by 5.8 to 21.5 points on healthy data and 0.6 to 7.5 points on macro F1 scores, and it improved macro F1 by 1.1 to 7.2 points over EfficientNetV2.
- [207] arXiv:2407.17907 [pdf, html, other]
-
Title: Amortized Posterior Sampling with Diffusion Prior DistillationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We propose a variational inference approach to sample from the posterior distribution for solving inverse problems. From a pre-trained diffusion model, our approach trains a conditional flow model to minimize the divergence between the proposal variational distribution and the posterior distribution implicitly defined through the diffusion model. Once trained, the flow model is capable of sampling from the posterior distribution with a single NFE, amortized with respect to the measurement. The proposed method paves a new path for distilling a diffusion prior for efficient posterior sampling. We show that our method is applicable to standard signals in Euclidean space, as well as signals on manifold.
- [208] arXiv:2407.17909 [pdf, html, other]
-
Title: Separating Novel Features for Logical Anomaly Detection: A Straightforward yet Effective ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Vision-based inspection algorithms have significantly contributed to quality control in industrial settings, particularly in addressing structural defects like dent and contamination which are prevalent in mass production. Extensive research efforts have led to the development of related benchmarks such as MVTec AD (Bergmann et al., 2019). However, in industrial settings, there can be instances of logical defects, where acceptable items are found in unsuitable locations or product pairs do not match as expected. Recent methods tackling logical defects effectively employ knowledge distillation to generate difference maps. Knowledge distillation (KD) is used to learn normal data distribution in unsupervised manner. Despite their effectiveness, these methods often overlook the potential false negatives. Excessive similarity between the teacher network and student network can hinder the generation of a suitable difference map for logical anomaly detection. This technical report provides insights on handling potential false negatives by utilizing a simple constraint in KD-based logical anomaly detection methods. We select EfficientAD as a state-of-the-art baseline and apply a margin-based constraint to its unsupervised learning scheme. Applying this constraint, we can improve the AUROC for MVTec LOCO AD by 1.3 %.
- [209] arXiv:2407.17911 [pdf, html, other]
-
Title: ReCorD: Reasoning and Correcting Diffusion for HOI GenerationJian-Yu Jiang-Lin, Kang-Yang Huang, Ling Lo, Yi-Ning Huang, Terence Lin, Jhih-Ciang Wu, Hong-Han Shuai, Wen-Huang ChengComments: Accepted by ACM MM 2024. Project website: this https URLSubjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated images while efficiently reducing computational requirements. We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks, showcasing ReCorD's ability to render complex interactions accurately by outperforming existing methods in HOI classification score, as well as FID and Verb CLIP-Score. Project website is available at this https URL .
- [210] arXiv:2407.17914 [pdf, html, other]
-
Title: Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language ModelsSubjects: Computation and Language (cs.CL)
Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that -- with one remarkable exception -- representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.
- [211] arXiv:2407.17915 [pdf, html, other]
-
Title: The Dark Side of Function Calling: Pathways to Jailbreaking Large Language ModelsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90\% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures. Our code is available at this https URL.
- [212] arXiv:2407.17927 [pdf, html, other]
-
Title: Invariance of deep image quality metrics to affine transformationsComments: 12 pages 13 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Deep architectures are the current state-of-the-art in predicting subjective image quality. Usually, these models are evaluated according to their ability to correlate with human opinion in databases with a range of distortions that may appear in digital media. However, these oversee affine transformations which may represent better the changes in the images actually happening in natural conditions. Humans can be particularly invariant to these natural transformations, as opposed to the digital ones. In this work, we evaluate state-of-the-art deep image quality metrics by assessing their invariance to affine transformations, specifically: rotation, translation, scaling, and changes in spectral illumination. We propose a methodology to assign invisibility thresholds for any perceptual metric. This methodology involves transforming the distance measured by an arbitrary metric to a common distance representation based on available subjectively rated databases. We psychophysically measure an absolute detection threshold in that common representation and express it in the physical units of each affine transform for each metric. By doing so, we allow the analyzed metrics to be directly comparable with actual human thresholds. We find that none of the state-of-the-art metrics shows human-like results under this strong test based on invisibility thresholds. This means that tuning the models exclusively to predict the visibility of generic distortions may disregard other properties of human vision as for instance invariances or invisibility thresholds.
- [213] arXiv:2407.17929 [pdf, html, other]
-
Title: Guided Latent Slot Diffusion for Object-Centric LearningComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.
- [214] arXiv:2407.17930 [pdf, other]
-
Title: Comparison of different Artificial Neural Networks for Bitcoin price forecastingComments: 9 pages, 8 figures, 2 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
This study investigates the impact of varying sequence lengths on the accuracy of predicting cryptocurrency returns using Artificial Neural Networks (ANNs). Utilizing the Mean Absolute Error (MAE) as a threshold criterion, we aim to enhance prediction accuracy by excluding returns that are smaller than this threshold, thus mitigating errors associated with minor returns. The subsequent evaluation focuses on the accuracy of predicted returns that exceed this threshold. We compare four sequence lengths 168 hours (7 days), 72 hours (3 days), 24 hours, and 12 hours each with a return prediction interval of 2 hours. Our findings reveal the influence of sequence length on prediction accuracy and underscore the potential for optimized sequence configurations in financial forecasting models.
- [215] arXiv:2407.17933 [pdf, html, other]
-
Title: Segmentation by registration-enabled SAM prompt engineering using five reference imagesYaxi Chen, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng HuComments: Accepted to the 11th International Workshop on Biomedical Image Registration (WBIR 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
The recently proposed Segment Anything Model (SAM) is a general tool for image segmentation, but it requires additional adaptation and careful fine-tuning for medical image segmentation, especially for small, irregularly-shaped, and boundary-ambiguous anatomical structures such as the knee cartilage that is of interest in this work. Repaired cartilage, after certain surgical procedures, exhibits imaging patterns unseen to pre-training, posing further challenges for using models like SAM with or without general-purpose fine-tuning. To address this, we propose a novel registration-based prompt engineering framework for medical image segmentation using SAM. This approach utilises established image registration algorithms to align the new image (to-be-segmented) and a small number of reference images, without requiring segmentation labels. The spatial transformations generated by registration align either the new image or pre-defined point-based prompts, before using them as input to SAM. This strategy, requiring as few as five reference images with defined point prompts, effectively prompts SAM for inference on new images, without needing any segmentation labels. Evaluation of MR images from patients who received cartilage stem cell therapy yielded Dice scores of 0.89, 0.87, 0.53, and 0.52 for segmenting femur, tibia, femoral- and tibial cartilages, respectively. This outperforms atlas-based label fusion and is comparable to supervised nnUNet, an upper-bound fair baseline in this application, both of which require full segmentation labels for reference samples. The codes are available at: this https URL
- [216] arXiv:2407.17936 [pdf, html, other]
-
Title: Goal Estimation-based Adaptive Shared Control for Brain-Machine Interfaces Remote Robot NavigationSubjects: Robotics (cs.RO)
In this study, we propose a shared control method for teleoperated mobile robots using brain-machine interfaces (BMI). The control commands generated through BMI for robot operation face issues of low input frequency, discreteness, and uncertainty due to noise. To address these challenges, our method estimates the user's intended goal from their commands and uses this goal to generate auxiliary commands through the autonomous system that are both at a higher input frequency and more continuous. Furthermore, by defining the confidence level of the estimation, we adaptively calculated the weights for combining user and autonomous commands, thus achieving shared control.
- [217] arXiv:2407.17940 [pdf, html, other]
-
Title: Positive Text Reframing under Multi-strategy OptimizationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Differing from sentiment transfer, positive reframing seeks to substitute negative perspectives with positive expressions while preserving the original meaning. With the emergence of pre-trained language models (PLMs), it is possible to achieve acceptable results by fine-tuning PLMs. Nevertheless, generating fluent, diverse and task-constrained reframing text remains a significant challenge. To tackle this issue, a \textbf{m}ulti-\textbf{s}trategy \textbf{o}ptimization \textbf{f}ramework (MSOF) is proposed in this paper. Starting from the objective of positive reframing, we first design positive sentiment reward and content preservation reward to encourage the model to transform the negative expressions of the original text while ensuring the integrity and consistency of the semantics. Then, different decoding optimization approaches are introduced to improve the quality of text generation. Finally, based on the modeling formula of positive reframing, we propose a multi-dimensional re-ranking method that further selects candidate sentences from three dimensions: strategy consistency, text similarity and fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate our framework achieves significant improvements on unconstrained and controlled positive reframing tasks.
- [218] arXiv:2407.17941 [pdf, html, other]
-
Title: RDFGraphGen: A Synthetic RDF Graph Generator based on SHACL ConstraintsComments: 19 pagesSubjects: Software Engineering (cs.SE); Databases (cs.DB)
This paper introduces RDFGraphGen, a general-purpose, domain-independent generator of synthetic RDF graphs based on SHACL constraints. The Shapes Constraint Language (SHACL) is a W3C standard which specifies ways to validate data in RDF graphs, by defining constraining shapes. However, even though the main purpose of SHACL is validation of existing RDF data, in order to solve the problem with the lack of available RDF datasets in multiple RDF-based application development processes, we envisioned and implemented a reverse role for SHACL: we use SHACL shape definitions as a starting point to generate synthetic data for an RDF graph. The generation process involves extracting the constraints from the SHACL shapes, converting the specified constraints into rules, and then generating artificial data for a predefined number of RDF entities, based on these rules. The purpose of RDFGraphGen is the generation of small, medium or large RDF knowledge graphs for the purpose of benchmarking, testing, quality control, training and other similar purposes for applications from the RDF, Linked Data and Semantic Web domain. RDFGraphGen is open-source and is available as a ready-to-use Python package.
- [219] arXiv:2407.17942 [pdf, html, other]
-
Title: A Novel Perception Entropy Metric for Optimizing Vehicle Perception with LiDAR DeploymentSubjects: Robotics (cs.RO); Information Theory (cs.IT)
Developing an effective evaluation metric is crucial for accurately and swiftly measuring LiDAR perception performance. One major issue is the lack of metrics that can simultaneously generate fast and accurate evaluations based on either object detection or point cloud data. In this study, we propose a novel LiDAR perception entropy metric based on the probability of vehicle grid occupancy. This metric reflects the influence of point cloud distribution on vehicle detection performance. Based on this, we also introduce a LiDAR deployment optimization model, which is solved using a differential evolution-based particle swarm optimization algorithm. A comparative experiment demonstrated that the proposed PE-VGOP offers a correlation of more than 0.98 with vehicle detection ground truth in evaluating LiDAR perception performance. Furthermore, compared to the base deployment, field experiments indicate that the proposed optimization model can significantly enhance the perception capabilities of various types of LiDARs, including RS-16, RS-32, and RS-80. Notably, it achieves a 25% increase in detection Recall for the RS-32 LiDAR.
- [220] arXiv:2407.17944 [pdf, html, other]
-
Title: Time-Optimal Planning for Long-Range Quadrotor Flights: An Automatic Optimal Synthesis ApproachComments: 19 pages, 19 figuresSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Time-critical tasks such as drone racing typically cover large operation areas. However, it is difficult and computationally intensive for current time-optimal motion planners to accommodate long flight distances since a large yet unknown number of knot points is required to represent the trajectory. We present a polynomial-based automatic optimal synthesis (AOS) approach that can address this challenge. Our method not only achieves superior time optimality but also maintains a consistently low computational cost across different ranges while considering the full quadrotor dynamics. First, we analyze the properties of time-optimal quadrotor maneuvers to determine the minimal number of polynomial pieces required to capture the dominant structure of time-optimal trajectories. This enables us to represent substantially long minimum-time trajectories with a minimal set of variables. Then, a robust optimization scheme is developed to handle arbitrary start and end conditions as well as intermediate waypoints. Extensive comparisons show that our approach is faster than the state-of-the-art approach by orders of magnitude with comparable time optimality. Real-world experiments further validate the quality of the resulting trajectories, demonstrating aggressive time-optimal maneuvers with a peak velocity of 8.86 m/s.
- [221] arXiv:2407.17946 [pdf, other]
-
Title: Quantum-Inspired Evolutionary Algorithms for Feature Subset Selection: A Comprehensive SurveyComments: 43 pages, 13 tables, 5 figuresSubjects: Neural and Evolutionary Computing (cs.NE)
The clever hybridization of quantum computing concepts and evolutionary algorithms (EAs) resulted in a new field called quantum-inspired evolutionary algorithms (QIEAs). Unlike traditional EAs, QIEAs employ quantum bits to adopt a probabilistic representation of the state of a feature in a given solution. This unprecedented feature enables them to achieve better diversity and perform global search, effectively yielding a tradeoff between exploration and exploitation. We conducted a comprehensive survey across various publishers and gathered 56 papers. We thoroughly analyzed these publications, focusing on the novelty elements and types of heuristics employed by the extant quantum-inspired evolutionary algorithms (QIEAs) proposed to solve the feature subset selection (FSS) problem. Importantly, we provided a detailed analysis of the different types of objective functions and popular quantum gates, i.e., rotation gates, employed throughout the literature. Additionally, we suggested several open research problems to attract the attention of the researchers.
- [222] arXiv:2407.17947 [pdf, other]
-
Title: Supercritical Size-Width Tree-Like Resolution Trade-Offs for Graph IsomorphismComments: 32 pages, 2 figuresSubjects: Logic in Computer Science (cs.LO); Computational Complexity (cs.CC)
We study the refutation complexity of graph isomorphism in the tree-like resolution calculus. Torán and Wörz (TOCL 2023) showed that there is a resolution refutation of narrow width $k$ for two graphs if and only if they can be distinguished in ($k+1$)-variable first-order logic (FO$^{k+1}$) and hence by a count-free variant of the $k$-dimensional Weisfeiler-Leman algorithm. While DAG-like narrow width $k$ resolution refutations have size at most $n^k$, tree-like refutations may be much larger. We show that there are graphs of order n, whose isomorphism can be refuted in narrow width $k$ but only in tree-like size $2^{\Omega(n^{k/2})}$. This is a supercritical trade-off where bounding one parameter (the narrow width) causes the other parameter (the size) to grow above its worst case. The size lower bound is super-exponential in the formula size and improves a related supercritical width versus tree-like size trade-off by Razborov (JACM 2016). To prove our result, we develop a new variant of the $k$-pebble EF-game for FO$^k$ to reason about tree-like refutation size in a similar way as the Prover-Delayer games in proof complexity. We analyze this game on a modified variant of the compressed CFI graphs introduced by Grohe, Lichter, Neuen, and Schweitzer (FOCS 2023). Using a recent improved robust compressed CFI construction of Janett, Nordström, and Pang (unpublished manuscript), we obtain a similar bound for width $k$ (instead of the stronger but less common narrow width) and make the result more robust.
- [223] arXiv:2407.17950 [pdf, other]
-
Title: Real Time American Sign Language Detection Using Yolo-v9Comments: 11 pages, 13 figures, 1 tableSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper focuses on real-time American Sign Language Detection. YOLO is a convolutional neural network (CNN) based model, which was first released in 2015. In recent years, it gained popularity for its real-time detection capabilities. Our study specifically targets YOLO-v9 model, released in 2024. As the model is newly introduced, not much work has been done on it, especially not in Sign Language Detection. Our paper provides deep insight on how YOLO- v9 works and better than previous model.
- [224] arXiv:2407.17951 [pdf, html, other]
-
Title: Pruning Boolean d-DNNF Circuits Through Tseitin-AwarenessComments: submitted to ICTAI 2024Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Boolean circuits in d-DNNF form enable tractable probabilistic inference. However, as a key insight of this work, we show that commonly used d-DNNF compilation approaches introduce irrelevant subcircuits. We call these subcircuits Tseitin artifacts, as they are introduced due to the Tseitin transformation step -- a well-established procedure to transform any circuit into the CNF format required by several d-DNNF knowledge compilers. We discuss how to detect and remove both Tseitin variables and Tseitin artifacts, leading to more succinct circuits. We empirically observe an average size reduction of 77.5% when removing both Tseitin variables and artifacts. The additional pruning of Tseitin artifacts reduces the size by 22.2% on average. This significantly improves downstream tasks that benefit from a more succinct circuit, e.g., probabilistic inference tasks.
- [225] arXiv:2407.17952 [pdf, other]
-
Title: BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth EstimationXiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher SchroersSubjects: Computer Vision and Pattern Recognition (cs.CV)
By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.
- [226] arXiv:2407.17954 [pdf, html, other]
-
Title: Scaling Training Data with Lossy Image CompressionComments: 21 pages, 27 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time.
In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution.
In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level. - [227] arXiv:2407.17956 [pdf, html, other]
-
Title: SaccadeDet: A Novel Dual-Stage Architecture for Rapid and Accurate Detection in Gigapixel ImagesComments: This paper is accepted to ECML-PKDD 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
The advancement of deep learning in object detection has predominantly focused on megapixel images, leaving a critical gap in the efficient processing of gigapixel images. These super high-resolution images present unique challenges due to their immense size and computational demands. To address this, we introduce 'SaccadeDet', an innovative architecture for gigapixel-level object detection, inspired by the human eye saccadic movement. The cornerstone of SaccadeDet is its ability to strategically select and process image regions, dramatically reducing computational load. This is achieved through a two-stage process: the 'saccade' stage, which identifies regions of probable interest, and the 'gaze' stage, which refines detection in these targeted areas. Our approach, evaluated on the PANDA dataset, not only achieves an 8x speed increase over the state-of-the-art methods but also demonstrates significant potential in gigapixel-level pathology analysis through its application to Whole Slide Imaging.
- [228] arXiv:2407.17957 [pdf, html, other]
-
Title: Neural Networks for Generating Better Local Optima in Topology OptimizationSubjects: Machine Learning (cs.LG)
Neural networks have recently been employed as material discretizations within adjoint optimization frameworks for inverse problems and topology optimization. While advantageous regularization effects and better optima have been found for some inverse problems, the benefit for topology optimization has been limited -- where the focus of investigations has been the compliance problem. We demonstrate how neural network material discretizations can, under certain conditions, find better local optima in more challenging optimization problems, where we here specifically consider acoustic topology optimization. The chances of identifying a better optimum can significantly be improved by running multiple partial optimizations with different neural network initializations. Furthermore, we show that the neural network material discretization's advantage comes from the interplay with the Adam optimizer and emphasize its current limitations when competing with constrained and higher-order optimization techniques. At the moment, this discretization has only been shown to be beneficial for unconstrained first-order optimization.
- [229] arXiv:2407.17960 [pdf, html, other]
-
Title: The Curious Case of Representational Alignment: Unravelling Visio-Linguistic Tasks in Emergent CommunicationComments: Appeared at the 13th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2024)Subjects: Computation and Language (cs.CL)
Natural language has the universal properties of being compositional and grounded in reality. The emergence of linguistic properties is often investigated through simulations of emergent communication in referential games. However, these experiments have yielded mixed results compared to similar experiments addressing linguistic properties of human language. Here we address representational alignment as a potential contributing factor to these results. Specifically, we assess the representational alignment between agent image representations and between agent representations and input images. Doing so, we confirm that the emergent language does not appear to encode human-like conceptual visual features, since agent image representations drift away from inputs whilst inter-agent alignment increases. We moreover identify a strong relationship between inter-agent alignment and topographic similarity, a common metric for compositionality, and address its consequences. To address these issues, we introduce an alignment penalty that prevents representational drift but interestingly does not improve performance on a compositional discrimination task. Together, our findings emphasise the key role representational alignment plays in simulations of language emergence.
- [230] arXiv:2407.17963 [pdf, html, other]
-
Title: Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning TasksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood. To investigate these behaviors, arithmetic tasks serve as important venues. In previous studies, seemingly unrelated mysteries still exist -- (1) models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition, but their effectiveness varies in more complex tasks like multiplication; (2) models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101), regardless of the positional encoding used. We believe previous studies have been treating the symptoms rather than addressing the root cause -- they have paid excessive attention to improving model components, while overlooking the differences in task properties that may be the real drivers. This is confirmed by our unified theoretical framework for different arithmetic scenarios. For example, unlike multiplication, the digital addition task has the property of translation invariance which naturally aligns with the relative positional encoding, and this combination leads to successful generalization of addition to unseen longer domains. The discrepancy in operations modulo 100 and 101 arises from the base. Modulo 100, unlike 101, is compatible with the decimal system (base 10), such that unseen information in digits beyond the units digit and the tens digit is actually not needed for the task. Extensive experiments with GPT-like models validate our theoretical predictions. These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
- [231] arXiv:2407.17964 [pdf, other]
-
Title: A robust and time-parallel preconditioner for parabolic reconstruction problems using Isogeometric AnalysisComments: This work was supported by the Research Council of Norway through the grants 300305 and 301013. This work was supported by the Austrian Science Fund (FWF): https://doi.org/10.55776/P33956Subjects: Numerical Analysis (math.NA)
We consider a PDE-constrained optimization problem of tracking type with parabolic state equation. The solution to the problem is characterized by the Karush-Kuhn-Tucker (KKT) system, which we formulate using a strong variational formulation of the state equation and a super weak formulation of the adjoined state equation. This allows us to propose a preconditioner that is robust both in the regularization and the diffusion parameter. In order to discretize the problem, we use Isogeometric Analysis since it allows the construction of sufficiently smooth basis functions effortlessly. To realize the preconditioner, one has to solve a problem over the whole space time cylinder that is elliptic with respect to certain non-standard norms. Using a fast diagonalization approach in time, we reformulate the problem as a collection of elliptic problems in space only. These problems are not only smaller, but our approach also allows to solve them in a time-parallel way. We show the efficiency of the preconditioner by rigorous analysis and illustrate it with numerical experiments.
- [232] arXiv:2407.17967 [pdf, html, other]
-
Title: Lightweight Language-driven Grasp Detection using Conditional Consistency ModelComments: Accepted at IROS 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.
- [233] arXiv:2407.17973 [pdf, html, other]
-
Title: Limited Voting for Better Representation?Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Limited Voting (LV) is an approval-based method for multi-winner elections where all ballots are required to have a same fixed size. While it appears to be used as voting method in corporate governance and has some political applications, to the best of our knowledge, no formal analysis of the rule exists to date. We provide such an analysis here, prompted by a request for advice about this voting rule by a health insurance company in the Netherlands, which uses it to elect its work council. We study conditions under which LV would improve representation over standard approval voting and when it would not. We establish the extent of such an improvement, or lack thereof, both in terms of diversity and proportionality notions. These results help us understand if, and how, LV may be used as a low-effort fix of approval voting in order to enhance representation.
- [234] arXiv:2407.17974 [pdf, html, other]
-
Title: What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language modelsComments: Appeared at the 13th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2024)Subjects: Computation and Language (cs.CL)
Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
- [235] arXiv:2407.17980 [pdf, html, other]
-
Title: Personalized and Context-aware Route Planning for Edge-assisted VehiclesSubjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Conventional route planning services typically offer the same routes to all drivers, focusing primarily on a few standardized factors such as travel distance or time, overlooking individual driver preferences. With the inception of autonomous vehicles expected in the coming years, where vehicles will rely on routes decided by such planners, there arises a need to incorporate the specific preferences of each driver, ensuring personalized navigation experiences. In this work, we propose a novel approach based on graph neural networks (GNNs) and deep reinforcement learning (DRL), aimed at customizing routes to suit individual preferences. By analyzing the historical trajectories of individual drivers, we classify their driving behavior and associate it with relevant road attributes as indicators of driver preferences. The GNN is capable of representing the road network as graph-structured data effectively, while DRL is capable of making decisions utilizing reward mechanisms to optimize route selection with factors such as travel costs, congestion level, and driver satisfaction. We evaluate our proposed GNN-based DRL framework using a real-world road network and demonstrate its ability to accommodate driver preferences, offering a range of route options tailored to individual drivers. The results indicate that our framework can select routes that accommodate driver's preferences with up to a 17% improvement compared to a generic route planner, and reduce the travel time by 33% (afternoon) and 46% (evening) relatively to the shortest distance-based approach.
- [236] arXiv:2407.17990 [pdf, html, other]
-
Title: Towards Living Software Architecture DiagramsFilipe F. Correia, Ricardo Ferreira, Paulo G. G Queiroz, Henrique Nunes, Matilde Barra, Duarte FigueiredoSubjects: Software Engineering (cs.SE)
Software architecture often consists of interconnected components dispersed across source code and other development artifacts, making visualization difficult without costly additional documentation. Although some tools can automatically generate architectural diagrams, these hardly fully reflect the architecture of the system. We propose the value of automatic architecture recovery from multiple software artifacts, combined with the ability to manually adjust recovered models and automate the recovery process. We present a general approach to achieve this and describe a tool that generates architectural diagrams for a software system by analyzing its software artifacts and unifying them into a comprehensive system representation. This representation can be manually modified while ensuring that changes are reintegrated into the diagram when it is regenerated. We argue that adopting a similar approach in other types of documentation tools is possible and can render similar benefits.
- [237] arXiv:2407.17992 [pdf, html, other]
-
Title: Amortized Active Learning for Nonparametric FunctionsSubjects: Machine Learning (cs.LG)
Active learning (AL) is a sequential learning scheme aiming to select the most informative data. AL reduces data consumption and avoids the cost of labeling large amounts of data. However, AL trains the model and solves an acquisition optimization for each selection. It becomes expensive when the model training or acquisition optimization is challenging. In this paper, we focus on active nonparametric function learning, where the gold standard Gaussian process (GP) approaches suffer from cubic time complexity. We propose an amortized AL method, where new data are suggested by a neural network which is trained up-front without any real data (Figure 1). Our method avoids repeated model training and requires no acquisition optimization during the AL deployment. We (i) utilize GPs as function priors to construct an AL simulator, (ii) train an AL policy that can zero-shot generalize from simulation to real learning problems of nonparametric functions and (iii) achieve real-time data selection and comparable learning performances to time-consuming baseline methods.
- [238] arXiv:2407.17994 [pdf, html, other]
-
Title: Discursive Patinas: Anchoring Discussions in Data VisualizationsSubjects: Human-Computer Interaction (cs.HC)
This paper presents discursive patinas, a technique to visualize discussions onto data visualizations, inspired by how people leave traces in the physical world. While data visualizations are widely discussed in online communities and social media, comments tend to be displayed separately from the visualization and we lack ways to relate these discussions back to the content of the visualization, e.g., to situate comments, explain visual patterns, or question assumptions. In our visualization annotation interface, users can designate areas within the visualization. Discursive patinas are made of overlaid visual marks (anchors), attached to textual comments with category labels, likes, and replies. By coloring and styling the anchors, a meta visualization emerges, showing what and where people comment and annotate the visualization. These patinas show regions of heavy discussions, recent commenting activity, and the distribution of questions, suggestions, or personal stories. We ran workshops with 90 students, domain experts, and visualization researchers to study how people use anchors to discuss visualizations and how patinas influence people's understanding of the discussion. Our results show that discursive patinas improve the ability to navigate discussions and guide people to comments that help understand, contextualize, or scrutinize the visualization. We discuss the potential of anchors and patinas to support discursive engagements, including critical readings of visualizations, design feedback, and feminist approaches to data visualization.
- [239] arXiv:2407.17995 [pdf, html, other]
-
Title: Notes on symmetries and reductions of algebraic equationsComments: 6 pagesSubjects: Numerical Analysis (math.NA)
Symmetries and reductions of some algebraic equations are considered. Transformations that preserve the form of several algebraic equations, as well as transformations that reduce the degree of these equations, are described. Illustrative examples are provided. The obtained results and solutions can be used as test problems for numerical methods of solving algebraic equations.
- [240] arXiv:2407.17996 [pdf, html, other]
-
Title: Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile PhotographySubjects: Computer Vision and Pattern Recognition (cs.CV)
The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spectral decomposition model guided enhancement framework, which consists of two steps: joint decomposition and prior-guided enhancement. Firstly, we leverage the complementarity between RGB and Low-resolution Multi-Spectral Images (Lr-MSI) to predict shading, reflectance, and material semantic priors. Subsequently, these priors are seamlessly integrated into the established HDRNet to promote dynamic range enhancement, color mapping, and grid expert learning, respectively. Additionally, we construct a high-quality Mobile-Spec dataset to support our research, and our experiments validate the effectiveness of Lr-MSI in the tone enhancement task. This work aims to establish a solid foundation for advancing spectral vision in mobile photography. The code is available at \url{this https URL}.
- [241] arXiv:2407.17997 [pdf, html, other]
-
Title: On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition ArchitecturesComments: Accepted at the SynData4GenAI 2024 workshopSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural network hidden Markov model and a Gaussian mixture hidden Markov model, showing the different sensitivity of the models to synthetic data generation. In order to extend previous work, we present a number of ablation studies on the effectiveness of synthetic vs. real training data for ASR. In particular we focus on how the gap between training on synthetic and real data changes by varying the speaker embedding or by scaling the model size. For the latter we show that the TTS models generalize well, even when training scores indicate overfitting.
- [242] arXiv:2407.17998 [pdf, other]
-
Title: iNNspector: Visual, Interactive Deep Model DebuggingComments: 41 pages paper, 4 pages references, 3 pages appendix, 19 figures, 2 tablesSubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Deep learning model design, development, and debugging is a process driven by best practices, guidelines, trial-and-error, and the personal experiences of model developers. At multiple stages of this process, performance and internal model data can be logged and made available. However, due to the sheer complexity and scale of this data and process, model developers often resort to evaluating their model performance based on abstract metrics like accuracy and loss. We argue that a structured analysis of data along the model's architecture and at multiple abstraction levels can considerably streamline the debugging process. Such a systematic analysis can further connect the developer's design choices to their impacts on the model behavior, facilitating the understanding, diagnosis, and refinement of deep learning models. Hence, in this paper, we (1) contribute a conceptual framework structuring the data space of deep learning experiments. Our framework, grounded in literature analysis and requirements interviews, captures design dimensions and proposes mechanisms to make this data explorable and tractable. To operationalize our framework in a ready-to-use application, we (2) present the iNNspector system. iNNspector enables tracking of deep learning experiments and provides interactive visualizations of the data on all levels of abstraction from multiple models to individual neurons. Finally, we (3) evaluate our approach with three real-world use-cases and a user study with deep learning developers and data analysts, proving its effectiveness and usability.
- [243] arXiv:2407.17999 [pdf, html, other]
-
Title: Lightweight Industrial Cohorted Federated Learning for Heterogeneous AssetsSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Federated Learning (FL) is the most widely adopted collaborative learning approach for training decentralized Machine Learning (ML) models by exchanging learning between clients without sharing the data and compromising privacy. However, since great data similarity or homogeneity is taken for granted in all FL tasks, FL is still not specifically designed for the industrial setting. Rarely this is the case in industrial data because there are differences in machine type, firmware version, operational conditions, environmental factors, and hence, data distribution. Albeit its popularity, it has been observed that FL performance degrades if the clients have heterogeneous data distributions. Therefore, we propose a Lightweight Industrial Cohorted FL (LICFL) algorithm that uses model parameters for cohorting without any additional on-edge (clientlevel) computations and communications than standard FL and mitigates the shortcomings from data heterogeneity in industrial applications. Our approach enhances client-level model performance by allowing them to collaborate with similar clients and train more specialized or personalized models. Also, we propose an adaptive aggregation algorithm that extends the LICFL to Adaptive LICFL (ALICFL) for further improving the global model performance and speeding up the convergence. Through numerical experiments on real-time data, we demonstrate the efficacy of the proposed algorithms and compare the performance with existing approaches.
- [244] arXiv:2407.18000 [pdf, html, other]
-
Title: Investigation to answer three key questions concerning plant pest identification and development of a practical identification frameworkComments: 40 pages, 10 figuresJournal-ref: Computers and Electronics in Agriculture, Volume 222, July 2024, 109021Subjects: Computer Vision and Pattern Recognition (cs.CV)
The development of practical and robust automated diagnostic systems for identifying plant pests is crucial for efficient agricultural production. In this paper, we first investigate three key research questions (RQs) that have not been addressed thus far in the field of image-based plant pest identification. Based on the knowledge gained, we then develop an accurate, robust, and fast plant pest identification framework using 334K images comprising 78 combinations of four plant portions (the leaf front, leaf back, fruit, and flower of cucumber, tomato, strawberry, and eggplant) and 20 pest species captured at 27 farms. The results reveal the following. (1) For an appropriate evaluation of the model, the test data should not include images of the field from which the training images were collected, or other considerations to increase the diversity of the test set should be taken into account. (2) Pre-extraction of ROIs, such as leaves and fruits, helps to improve identification accuracy. (3) Integration of closely related species using the same control methods and cross-crop training methods for the same pests, are effective. Our two-stage plant pest identification framework, enabling ROI detection and convolutional neural network (CNN)-based identification, achieved a highly practical performance of 91.0% and 88.5% in mean accuracy and macro F1 score, respectively, for 12,223 instances of test data of 21 classes collected from unseen fields, where 25 classes of images from 318,971 samples were used for training; the average identification time was 476 ms/image.
- [245] arXiv:2407.18002 [pdf, html, other]
-
Title: Network Inversion of Convolutional Neural NetsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes." This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a carefully conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we hideously encode the conditioning label information into vectors, further exemplified by heavy dropout in the generation process and minimisation of cosine similarity between the features corresponding to the generated images. The paper concludes with immediate applications of Network Inversion including in interpretability, explainability and generation of adversarial samples.
- [246] arXiv:2407.18003 [pdf, html, other]
-
Title: Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache ConsumptionComments: to be published in CoLM 2024Subjects: Computation and Language (cs.CL)
Large Language Models (LLMs), epitomized by ChatGPT' s release in late 2022, have revolutionized various industries with their advanced language comprehension. However, their efficiency is challenged by the Transformer architecture' s struggle with handling long texts. KV-Cache has emerged as a pivotal solution to this issue, converting the time complexity of token generation from quadratic to linear, albeit with increased GPU memory overhead proportional to conversation length. With the development of the LLM community and academia, various KV-Cache compression methods have been proposed. In this review, we dissect the various properties of KV-Cache and elaborate on various methods currently used to optimize the KV-Cache space usage of LLMs. These methods span the pre-training phase, deployment phase, and inference phase, and we summarize the commonalities and differences among these methods. Additionally, we list some metrics for evaluating the long-text capabilities of large language models, from both efficiency and capability perspectives. Our review thus sheds light on the evolving landscape of LLM optimization, offering insights into future advancements in this dynamic field.
- [247] arXiv:2407.18004 [pdf, html, other]
-
Title: Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-ReductionComments: arXiv admin note: substantial text overlap with arXiv:2312.11236Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
We give optimally fast $O(\log p)$ time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers difficult questions posed in a SPAA 2022 BA and a CLUSTER 2022 paper. We observe that the computed schedules and circulant communication graph can likewise be used for reduction, all-broadcast and all-reduction as well, leading to new, round-optimal algorithms for these problems. These observations affirmatively answer open questions posed in a CLUSTER 2023 paper.
The problem is to broadcast $n$ indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of $p$ processors with fully bidirectional, one-ported communication capabilities. In this model, $n-1+\lceil\log_2 p\rceil$ communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size $\lceil\log_2 p\rceil$ that determine uniquely in $O(1)$ time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is an easily computable, directed, $\lceil\log_2 p\rceil$-regular circulant graph also used elsewhere. We show how the schedule computations can be done in optimal time and space of $O(\log p)$, improving significantly over previous results of $O(p\log^2 p)$ and $O(\log^3 p)$, respectively. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. The schedules are used for new implementations of the MPI (Message-Passing Interface) collectives MPI_Bcast, MPI_Allgatherv, MPI_Reduce and MPI_Reduce_scatter. Preliminary experimental results are given. - [248] arXiv:2407.18005 [pdf, other]
-
Title: An Exploration Study on Developing Blockchain Systems the Practitioners PerspectiveSubjects: Software Engineering (cs.SE)
Context: Blockchain-based software (BBS) exploits the concepts and technologies popularized by cryptocurrencies offering decentralized transaction ledgers with immutable content for security-critical and transaction critical systems. Recent research has explored the strategic benefits and technical limitations of BBS in various fields, including cybersecurity, healthcare, education, and financial technologies. Despite growing interest from academia and industry, there is a lack of empirical evidence, leading to an incomplete understanding of the processes, methods, and techniques necessary for systematic BBS development. Objectives: Existing research lacks a consolidated view, particularly empirically driven guidelines based on published evidence and development practices. This study aims to address the gap by consolidating empirical evidence and development practices to derive or leverage existing processes, patterns, and models for designing, implementing, and validating BBS systems. Method: Tied to this knowledge gap, we conducted a two-phase research project. First, a systematic literature review of 58 studies was performed to identify a development process comprising 23 tasks for BBS systems. Second, a survey of 102 blockchain practitioners from 35 countries across six continents was conducted to validate the BBS system development process. Results: Our results revealed a statistically significant difference (p-value <.001) in the importance ratings of 24 out of 26 BBS tasks by our participants. The only two tasks that were not statistically significant were incentive protocol design and granularity design. Conclusion: Our research is among the first to advance understanding on the aspect of development process for blockchain-based systems and helps researchers and practitioners in their quests on challenges and recommendations associated with the development of BBS systems
- [249] arXiv:2407.18006 [pdf, other]
-
Title: The Existential Theory of the Reals as a Complexity Class: A CompendiumComments: 126 pages, 12 figures, 6 tables, about 150 complete problems and about 50 open problemsSubjects: Computational Complexity (cs.CC); Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
We survey the complexity class $\exists \mathbb{R}$, which captures the complexity of deciding the existential theory of the reals. The class $\exists \mathbb{R}$ has roots in two different traditions, one based on the Blum-Shub-Smale model of real computation, and the other following work by Mnëv and Shor on the universality of realization spaces of oriented matroids. Over the years the number of problems for which $\exists \mathbb{R}$ rather than NP has turned out to be the proper way of measuring their complexity has grown, particularly in the fields of computational geometry, graph drawing, game theory, and some areas in logic and algebra. $\exists \mathbb{R}$ has also started appearing in the context of machine learning, Markov decision processes, and probabilistic reasoning.
We have aimed at collecting a comprehensive compendium of problems complete and hard for $\exists \mathbb{R}$, as well as a long list of open problems. The compendium is presented in the third part of our survey; a tour through the compendium and the areas it touches on makes up the second part. The first part introduces the reader to the existential theory of the reals as a complexity class, discussing its history, motivation and prospects as well as some technical aspects. - [250] arXiv:2407.18008 [pdf, html, other]
-
Title: GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and SycophancyComments: 12 pagesSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
LLMs are changing the way humans create and interact with content, potentially affecting citizens' political opinions and voting decisions. As LLMs increasingly shape our digital information ecosystems, auditing to evaluate biases, sycophancy, or steerability has emerged as an active field of research. In this paper, we evaluate and compare the alignment of six LLMs by OpenAI, Anthropic, and Cohere with German party positions and evaluate sycophancy based on a prompt experiment. We contribute to evaluating political bias and sycophancy in multi-party systems across major commercial LLMs. First, we develop the benchmark dataset GermanPartiesQA based on the Voting Advice Application Wahl-o-Mat covering 10 state and 1 national elections between 2021 and 2023. In our study, we find a left-green tendency across all examined LLMs. We then conduct our prompt experiment for which we use the benchmark and sociodemographic data of leading German parliamentarians to evaluate changes in LLMs responses. To differentiate between sycophancy and steerabilty, we use 'I am [politician X], ...' and 'You are [politician X], ...' prompts. Against our expectations, we do not observe notable differences between prompting 'I am' and 'You are'. While our findings underscore that LLM responses can be ideologically steered with political personas, they suggest that observed changes in LLM outputs could be better described as personalization to the given context rather than sycophancy.
- [251] arXiv:2407.18009 [pdf, html, other]
-
Title: Egocentric Robots in a Human-Centric World? Exploring Group-Robot-Interaction in Public SpacesComments: Accepted at the workshop on advancing Group Understanding and robots' adaptive behavior (GROUND), held at the Robotics Science and Systems (RSS) Conference, 2024Subjects: Robotics (cs.RO)
The deployment of social robots in real-world scenarios is increasing, supporting humans in various contexts. However, they still struggle to grasp social dynamics, especially in public spaces, sometimes resulting in violations of social norms, such as interrupting human conversations. This behavior, originating from a limited processing of social norms, might be perceived as robot-centered. Understanding social dynamics, particularly in group-robot-interactions (GRI), underscores the need for further research and development in human-robot-interaction (HRI). Enhancing the interaction abilities of social robots, especially in GRIs, can improve their effectiveness in real-world applications on a micro-level, as group interactions lead to increased motivation and comfort. In this study, we assessed the influence of the interaction condition (dyadic vs. triadic) on the perceived extraversion (ext.) of social robots in public spaces. The research involved 40 HRIs, including 24 dyadic (i.e., one human and one robot) interactions and 16 triadic interactions, which involve at least three entities, including the robot.
- [252] arXiv:2407.18010 [pdf, html, other]
-
Title: Stochastic Games with Minimally Bounded Action CostsComments: arXiv admin note: text overlap with arXiv:2205.15953Subjects: Multiagent Systems (cs.MA)
In many multi-player interactions, players incur strictly positive costs each time they execute actions e.g. 'menu costs' or transaction costs in financial systems. Since acting at each available opportunity would accumulate prohibitively large costs, the resulting decision problem is one in which players must make strategic decisions about when to execute actions in addition to their choice of action. This paper analyses a discrete-time stochastic game (SG) in which players face minimally bounded positive costs for each action and influence the system using impulse controls. We prove SGs of two-sided impulse control have a unique value and characterise the saddle point equilibrium in which the players execute actions at strategically chosen times in accordance with Markovian strategies. We prove the game respects a dynamic programming principle and that the Markov perfect equilibrium can be computed as a limit point of a sequence of Bellman operations. We then introduce a new Q-learning variant which we show converges almost surely to the value of the game enabling solutions to be extracted in unknown settings. Lastly, we extend our results to settings with budgetory constraints.
- [253] arXiv:2407.18011 [pdf, html, other]
-
Title: HANNA: Hard-constraint Neural Network for Consistent Activity Coefficient PredictionSubjects: Machine Learning (cs.LG)
We present the first hard-constraint neural network for predicting activity coefficients (HANNA), a thermodynamic mixture property that is the basis for many applications in science and engineering. Unlike traditional neural networks, which ignore physical laws and result in inconsistent predictions, our model is designed to strictly adhere to all thermodynamic consistency criteria. By leveraging deep-set neural networks, HANNA maintains symmetry under the permutation of the components. Furthermore, by hard-coding physical constraints in the network architecture, we ensure consistency with the Gibbs-Duhem equation and in modeling the pure components. The model was trained and evaluated on 317,421 data points for activity coefficients in binary mixtures from the Dortmund Data Bank, achieving significantly higher prediction accuracies than the current state-of-the-art model UNIFAC. Moreover, HANNA only requires the SMILES of the components as input, making it applicable to any binary mixture of interest. HANNA is fully open-source and available for free use.
- [254] arXiv:2407.18013 [pdf, html, other]
-
Title: Self-Supervision Improves Diffusion Models for Tabular Data ImputationComments: 10 pages, 5 figures. Accepted by CIKM 2024Subjects: Machine Learning (cs.LG)
The ubiquity of missing data has sparked considerable attention and focus on tabular data imputation methods. Diffusion models, recognized as the cutting-edge technique for data generation, demonstrate significant potential in tabular data imputation tasks. However, in pursuit of diversity, vanilla diffusion models often exhibit sensitivity to initialized noises, which hinders the models from generating stable and accurate imputation results. Additionally, the sparsity inherent in tabular data poses challenges for diffusion models in accurately modeling the data manifold, impacting the robustness of these models for data imputation. To tackle these challenges, this paper introduces an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity), specifically tailored for tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. Furthermore, we introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data. Extensive experiments demonstrate that SimpDM matches or outperforms state-of-the-art imputation methods across various scenarios.
- [255] arXiv:2407.18015 [pdf, html, other]
-
Title: Uncertainty Visualization of Critical Points of 2D Scalar Fields for Parametric and Nonparametric Probabilistic ModelsTushar M. Athawale, Zhe Wang, David Pugmire, Kenneth Moreland, Qian Gong, Scott Klasky, Chris R. Johnson, Paul RosenComments: 9 pages paper + 2 page references, 8 figures, IEEE VIS 2024 paper to be published as a special issue of IEEE Transactions on Visualization and Computer Graphics (TVCG)Subjects: Graphics (cs.GR)
This paper presents a novel end-to-end framework for closed-form computation and visualization of critical point uncertainty in 2D uncertain scalar fields. Critical points are fundamental topological descriptors used in the visualization and analysis of scalar fields. The uncertainty inherent in data (e.g., observational and experimental data, approximations in simulations, and compression), however, creates uncertainty regarding critical point positions. Uncertainty in critical point positions, therefore, cannot be ignored, given their impact on downstream data analysis tasks. In this work, we study uncertainty in critical points as a function of uncertainty in data modeled with probability distributions. Although Monte Carlo (MC) sampling techniques have been used in prior studies to quantify critical point uncertainty, they are often expensive and are infrequently used in production-quality visualization software. We, therefore, propose a new end-to-end framework to address these challenges that comprises a threefold contribution. First, we derive the critical point uncertainty in closed form, which is more accurate and efficient than the conventional MC sampling methods. Specifically, we provide the closed-form and semianalytical (a mix of closed-form and MC methods) solutions for parametric (e.g., uniform, Epanechnikov) and nonparametric models (e.g., histograms) with finite support. Second, we accelerate critical point probability computations using a parallel implementation with the VTK-m library, which is platform portable. Finally, we demonstrate the integration of our implementation with the ParaView software system to demonstrate near-real-time results for real datasets.
- [256] arXiv:2407.18022 [pdf, other]
-
Title: Learning mental states estimation through self-observation: a developmental synergy between intentions and beliefs representations in a deep-learning model of Theory of MindSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Theory of Mind (ToM), the ability to attribute beliefs, intentions, or mental states to others, is a crucial feature of human social interaction. In complex environments, where the human sensory system reaches its limits, behaviour is strongly driven by our beliefs about the state of the world around us. Accessing others' mental states, e.g., beliefs and intentions, allows for more effective social interactions in natural contexts. Yet, these variables are not directly observable, making understanding ToM a challenging quest of interest for different fields, including psychology, machine learning and robotics. In this paper, we contribute to this topic by showing a developmental synergy between learning to predict low-level mental states (e.g., intentions, goals) and attributing high-level ones (i.e., beliefs). Specifically, we assume that learning beliefs attribution can occur by observing one's own decision processes involving beliefs, e.g., in a partially observable environment. Using a simple feed-forward deep learning model, we show that, when learning to predict others' intentions and actions, more accurate predictions can be acquired earlier if beliefs attribution is learnt simultaneously. Furthermore, we show that the learning performance improves even when observed actors have a different embodiment than the observer and the gain is higher when observing beliefs-driven chunks of behaviour. We propose that our computational approach can inform the understanding of human social cognitive development and be relevant for the design of future adaptive social robots able to autonomously understand, assist, and learn from human interaction partners in novel natural environments and tasks.
- [257] arXiv:2407.18031 [pdf, html, other]
-
Title: $k$-Center Clustering in Distributed ModelsComments: Presented in SIROCCO'24 conferenceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
The $k$-center problem is a central optimization problem with numerous applications for machine learning, data mining, and communication networks. Despite extensive study in various scenarios, it surprisingly has not been thoroughly explored in the traditional distributed setting, where the communication graph of a network also defines the distance metric.
We initiate the study of the $k$-center problem in a setting where the underlying metric is the graph's shortest path metric in three canonical distributed settings: the LOCAL, CONGEST, and CLIQUE models. Our results encompass constant-factor approximation algorithms and lower bounds in these models, as well as hardness results for the bi-criteria approximation setting. - [258] arXiv:2407.18033 [pdf, other]
-
Title: ECG Arrhythmia Detection Using Disease-specific Attention-based Deep Learning ModelComments: 11 pages, 5 figuresSubjects: Machine Learning (cs.LG)
The electrocardiogram (ECG) is one of the most commonly-used tools to diagnose cardiovascular disease in clinical practice. Although deep learning models have achieved very impressive success in the field of automatic ECG analysis, they often lack model interpretability that is significantly important in the healthcare applications. To this end, many schemes such as general-purpose attention mechanism, Grad-CAM technique and ECG knowledge graph were proposed to be integrated with deep learning models. However, they either result in decreased classification performance or do not consist with the one in cardiologists' mind when interpreting ECG. In this study, we propose a novel disease-specific attention-based deep learning model (DANet) for arrhythmia detection from short ECG recordings. The novel idea is to introduce a soft-coding or hard-coding waveform enhanced module into existing deep neural networks, which amends original ECG signals with the guidance of the rule for diagnosis of a given disease type before being fed into the classification module. For the soft-coding DANet, we also develop a learning framework combining self-supervised pre-training with two-stage supervised training. To verify the effectiveness of our proposed DANet, we applied it to the problem of atrial premature contraction detection and the experimental results shows that it demonstrates superior performance compared to the benchmark model. Moreover, it also provides the waveform regions that deserve special attention in the model's decision-making process, allowing it to be a medical diagnostic assistant for physicians.
- [259] arXiv:2407.18034 [pdf, html, other]
-
Title: AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the WildComments: Accepted by ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
- [260] arXiv:2407.18035 [pdf, html, other]
-
Title: RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language ModelsHaoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei ZhuSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited range and often produce overly smooth, low-fidelity outcomes due to their broad data distribution fitting. To address these challenges, we first define a new pipeline for restoring images with multiple degradations, and then introduce RestoreAgent, an intelligent image restoration system leveraging multimodal large language models. RestoreAgent autonomously assesses the type and extent of degradation in input images and performs restoration through (1) determining the appropriate restoration tasks, (2) optimizing the task sequence, (3) selecting the most suitable models, and (4) executing the restoration. Experimental results demonstrate the superior performance of RestoreAgent in handling complex degradation, surpassing human experts. Furthermore, the system modular design facilitates the fast integration of new tasks and models, enhancing its flexibility and scalability for various applications.
- [261] arXiv:2407.18036 [pdf, html, other]
-
Title: Multi-View Structural Graph SummariesSubjects: Data Structures and Algorithms (cs.DS)
A structural graph summary is a small graph representation that preserves structural information necessary for a given task. The summary is used instead of the original graph to complete the task faster. We introduce multi-view structural graph summaries and propose an algorithm for merging two summaries. We conduct a theoretical analysis of our algorithm. We run experiments on three datasets, contributing two new ones. The datasets are of different domains (web graph, source code, and news) and sizes; the interpretation of multi-view depends on the domain and are pay-level domains on the web, control vs.\@ data flow of the code, and news broadcasters. We experiment with three graph summary models: attribute collection, class collection, and their combination. We observe that merging two structural summaries has an upper bound of quadratic complexity; but under reasonable assumptions, it has linear-time worst-case complexity. The running time of merging has a strong linear correlation with the number of edges in the two summaries. Therefore, the experiments support the assumption that the upper bound of quadratic complexity is not tight and that linear complexity is possible. Furthermore, our experiments show that always merging the two smallest summaries by the number of edges is the most efficient strategy for merging multiple structural summaries.
- [262] arXiv:2407.18038 [pdf, html, other]
-
Title: TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning FrameworkSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication.
- [263] arXiv:2407.18039 [pdf, html, other]
-
Title: Peak-Controlled Logits Poisoning Attack in Federated DistillationComments: arXiv admin note: text overlap with arXiv:2401.03685Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated Distillation (FD) offers an innovative approach to distributed machine learning, leveraging knowledge distillation for efficient and flexible cross-device knowledge transfer without necessitating the upload of extensive model parameters to a central server. While FD has gained popularity, its vulnerability to poisoning attacks remains underexplored. To address this gap, we previously introduced FDLA (Federated Distillation Logits Attack), a method that manipulates logits communication to mislead and degrade the performance of client models. However, the impact of FDLA on participants with different identities and the effects of malicious modifications at various stages of knowledge transfer remain unexplored. To this end, we present PCFDLA (Peak-Controlled Federated Distillation Logits Attack), an advanced and more stealthy logits poisoning attack method for FD. PCFDLA enhances the effectiveness of FDLA by carefully controlling the peak values of logits to create highly misleading yet inconspicuous modifications. Furthermore, we introduce a novel metric for better evaluating attack efficacy, demonstrating that PCFDLA maintains stealth while being significantly more disruptive to victim models compared to its predecessors. Experimental results across various datasets confirm the superior impact of PCFDLA on model accuracy, solidifying its potential threat in federated distillation systems.
- [264] arXiv:2407.18041 [pdf, html, other]
-
Title: How to Train the Teacher Model for Effective Knowledge DistillationComments: The paper was accepted at ECCV2024Subjects: Machine Learning (cs.LG)
Recently, it was shown that the role of the teacher in knowledge distillation (KD) is to provide the student with an estimate of the true Bayes conditional probability density (BCPD). Notably, the new findings propose that the student's error rate can be upper-bounded by the mean squared error (MSE) between the teacher's output and BCPD. Consequently, to enhance KD efficacy, the teacher should be trained such that its output is close to BCPD in MSE sense. This paper elucidates that training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD, aligning with its core responsibility of providing the student with a BCPD estimate closely resembling it in MSE terms. In this respect, through a comprehensive set of experiments, we demonstrate that substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in state-of-the-art KD methods consistently boosts the student's accuracy, resulting in improvements of up to 2.6\%.
- [265] arXiv:2407.18042 [pdf, html, other]
-
Title: Lifelong Graph Summarization with Neural Networks: 2012, 2022, and a Time WarpSubjects: Machine Learning (cs.LG)
Summarizing web graphs is challenging due to the heterogeneity of the modeled information and its changes over time. We investigate the use of neural networks for lifelong graph summarization. Assuming we observe the web graph at a certain time, we train the networks to summarize graph vertices. We apply this trained network to summarize the vertices of the changed graph at the next point in time. Subsequently, we continue training and evaluating the network to perform lifelong graph summarization. We use the GNNs Graph-MLP and GraphSAINT, as well as an MLP baseline, to summarize the temporal graphs. We compare $1$-hop and $2$-hop summaries. We investigate the impact of reusing parameters from a previous snapshot by measuring the backward and forward transfer and the forgetting rate of the neural networks. Our extensive experiments on ten weekly snapshots of a web graph with over $100$M edges, sampled in 2012 and 2022, show that all networks predominantly use $1$-hop information to determine the summary, even when performing $2$-hop summarization. Due to the heterogeneity of web graphs, in some snapshots, the $2$-hop summary produces over ten times more vertex summaries than the $1$-hop summary. When using the network trained on the last snapshot from 2012 and applying it to the first snapshot of 2022, we observe a strong drop in accuracy. We attribute this drop over the ten-year time warp to the strongly increased heterogeneity of the web graph in 2022.
- [266] arXiv:2407.18043 [pdf, html, other]
-
Title: YOCO: You Only Calibrate Once for Accurate Extrinsic Parameter in LiDAR-Camera SystemsComments: IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENTSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
In a multi-sensor fusion system composed of cameras and LiDAR, precise extrinsic calibration contributes to the system's long-term stability and accurate perception of the environment. However, methods based on extracting and registering corresponding points still face challenges in terms of automation and precision. This paper proposes a novel fully automatic extrinsic calibration method for LiDAR-camera systems that circumvents the need for corresponding point registration. In our approach, a novel algorithm to extract required LiDAR correspondence point is proposed. This method can effectively filter out irrelevant points by computing the orientation of plane point clouds and extracting points by applying distance- and density-based thresholds. We avoid the need for corresponding point registration by introducing extrinsic parameters between the LiDAR and camera into the projection of extracted points and constructing co-planar constraints. These parameters are then optimized to solve for the extrinsic. We validated our method across multiple sets of LiDAR-camera systems. In synthetic experiments, our method demonstrates superior performance compared to current calibration techniques. Real-world data experiments further confirm the precision and robustness of the proposed algorithm, with average rotation and translation calibration errors between LiDAR and camera of less than 0.05 degree and 0.015m, respectively. This method enables automatic and accurate extrinsic calibration in a single one step, emphasizing the potential of calibration algorithms beyond using corresponding point registration to enhance the automation and precision of LiDAR-camera system calibration.
- [267] arXiv:2407.18044 [pdf, html, other]
-
Title: The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented GenerationComments: 22 pagesSubjects: Machine Learning (cs.LG)
Digital health chatbots powered by Large Language Models (LLMs) have the potential to significantly improve personal health management for chronic conditions by providing accessible and on-demand health coaching and question-answering. However, these chatbots risk providing unverified and inaccurate information because LLMs generate responses based on patterns learned from diverse internet data. Retrieval Augmented Generation (RAG) can help mitigate hallucinations and inaccuracies in LLM responses by grounding it on reliable content. However, efficiently and accurately retrieving most relevant set of content for real-time user questions remains a challenge. In this work, we introduce Query-Based Retrieval Augmented Generation (QB-RAG), a novel approach that pre-computes a database of potential queries from a content base using LLMs. For an incoming patient question, QB-RAG efficiently matches it against this pre-generated query database using vector search, improving alignment between user questions and the content. We establish a theoretical foundation for QB-RAG and provide a comparative analysis of existing retrieval enhancement techniques for RAG systems. Finally, our empirical evaluation demonstrates that QB-RAG significantly improves the accuracy of healthcare question answering, paving the way for robust and trustworthy LLM applications in digital health.
- [268] arXiv:2407.18046 [pdf, html, other]
-
Title: GaussianSR: High Fidelity 2D Gaussian Splatting for Arbitrary-Scale Image Super-ResolutionComments: 13 pages, 12 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Implicit neural representations (INRs) have significantly advanced the field of arbitrary-scale super-resolution (ASSR) of images. Most existing INR-based ASSR networks first extract features from the given low-resolution image using an encoder, and then render the super-resolved result via a multi-layer perceptron decoder. Although these approaches have shown promising results, their performance is constrained by the limited representation ability of discrete latent codes in the encoded features. In this paper, we propose a novel ASSR method named GaussianSR that overcomes this limitation through 2D Gaussian Splatting (2DGS). Unlike traditional methods that treat pixels as discrete points, GaussianSR represents each pixel as a continuous Gaussian field. The encoded features are simultaneously refined and upsampled by rendering the mutually stacked Gaussian fields. As a result, long-range dependencies are established to enhance representation ability. In addition, a classifier is developed to dynamically assign Gaussian kernels to all pixels to further improve flexibility. All components of GaussianSR (i.e., encoder, classifier, Gaussian kernels, and decoder) are jointly learned end-to-end. Experiments demonstrate that GaussianSR achieves superior ASSR performance with fewer parameters than existing methods while enjoying interpretable and content-aware feature aggregations.
- [269] arXiv:2407.18058 [pdf, html, other]
-
Title: I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognitionComments: Accepted to ISMIR 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an instrument ontology is proposed. This method reveals deficiencies in the systems' understanding of instruments and provides evidence of the need for fine-tuning text encoders on musical data.
- [270] arXiv:2407.18060 [pdf, html, other]
-
Title: Cross-Vendor Reproducibility of Radiomics-based Machine Learning Models for Computer-aided DiagnosisJatin Chaudhary, Ivan Jambor, Hannu Aronen, Otto Ettala, Jani Saunavaara, Peter Boström, Jukka Heikkonen, Rajeev Kanth, Harri MerisaariSubjects: Machine Learning (cs.LG)
Background: The reproducibility of machine-learning models in prostate cancer detection across different MRI vendors remains a significant challenge. Methods: This study investigates Support Vector Machines (SVM) and Random Forest (RF) models trained on radiomic features extracted from T2-weighted MRI images using Pyradiomics and MRCradiomics libraries. Feature selection was performed using the maximum relevance minimum redundancy (MRMR) technique. We aimed to enhance clinical decision support through multimodal learning and feature fusion. Results: Our SVM model, utilizing combined features from Pyradiomics and MRCradiomics, achieved an AUC of 0.74 on the Multi-Improd dataset (Siemens scanner) but decreased to 0.60 on the Philips test set. The RF model showed similar trends, with notable robustness for models using Pyradiomics features alone (AUC of 0.78 on Philips). Conclusions: These findings demonstrate the potential of multimodal feature integration to improve the robustness and generalizability of machine-learning models for clinical decision support in prostate cancer detection. This study marks a significant step towards developing reliable AI-driven diagnostic tools that maintain efficacy across various imaging platforms.
- [271] arXiv:2407.18061 [pdf, other]
-
Title: Difficulty Estimation and Simplification of French Text Using LLMsComments: 10 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
We leverage generative large language models for language learning applications, focusing on estimating the difficulty of foreign language texts and simplifying them to lower difficulty levels. We frame both tasks as prediction problems and develop a difficulty classification model using labeled examples, transfer learning, and large language models, demonstrating superior accuracy compared to previous approaches. For simplification, we evaluate the trade-off between simplification quality and meaning preservation, comparing zero-shot and fine-tuned performances of large language models. We show that meaningful text simplifications can be obtained with limited fine-tuning. Our experiments are conducted on French texts, but our methods are language-agnostic and directly applicable to other foreign languages.
- [272] arXiv:2407.18062 [pdf, html, other]
-
Title: Audio Entailment: Assessing Deductive Reasoning for Audio UnderstandingSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets -- AudioCaps and Clotho -- and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the zero-shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.
- [273] arXiv:2407.18064 [pdf, html, other]
-
Title: ComPeer: A Generative Conversational Agent for Proactive Peer SupportComments: 22 pages (7 figures, 7 tables)Subjects: Human-Computer Interaction (cs.HC)
Conversational Agents (CAs) acting as peer supporters have been widely studied and demonstrated beneficial for people's mental health. However, previous peer support CAs either are user-initiated or follow predefined rules to initiate the conversations, which may discourage users to engage and build relationships with the CAs for long-term benefits. In this paper, we develop ComPeer, a generative CA that can proactively offer adaptive peer support to users. ComPeer leverages large language models to detect and reflect significant events in the dialogue, enabling it to strategically plan the timing and content of proactive care. In addition, ComPeer incorporates peer support strategies, conversation history, and its persona into the generative messages. Our one-week between-subjects study (N=24) demonstrates ComPeer's strength in providing peer support over time and boosting users' engagement compared to a baseline user-initiated CA.
- [274] arXiv:2407.18066 [pdf, html, other]
-
Title: Multi-Agent Deep Reinforcement Learning for Resilience Optimization in 5G RANSoumeya Kaada, Dinh-Hieu Tran, Nguyen Van Huynh, Marie-Line Alberi Morel, Sofiene Jelassi, Gerardo RubinoSubjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Resilience is defined as the ability of a network to resist, adapt, and quickly recover from disruptions, and to continue to maintain an acceptable level of services from users' perspective. With the advent of future radio networks, including advanced 5G and upcoming 6G, critical services become integral to future networks, requiring uninterrupted service delivery for end users. Unfortunately, with the growing network complexity, user mobility and diversity, it becomes challenging to scale current resilience management techniques that rely on local optimizations to large dense network deployments. This paper aims to address this problem by globally optimizing the resilience of a dense multi-cell network based on multi-agent deep reinforcement learning. Specifically, our proposed solution can dynamically tilt cell antennas and reconfigure transmit power to mitigate outages and increase both coverage and service availability. A multi-objective optimization problem is formulated to simultaneously satisfy resiliency constraints while maximizing the service quality in the network area in order to minimize the impact of outages on neighbouring cells. Extensive simulations then demonstrate that with our proposed solution, the average service availability in terms of user throughput can be increased by up to 50-60% on average, while reaching a coverage availability of 99% in best cases.
- [275] arXiv:2407.18067 [pdf, html, other]
-
Title: HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video dataComments: 10 pages, 5 figures, 1 table; code & models available from this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.
- [276] arXiv:2407.18069 [pdf, html, other]
-
Title: C2P: Featuring Large Language Models with Causal ReasoningComments: arXiv admin note: text overlap with arXiv:2306.05836 by other authorsSubjects: Logic in Computer Science (cs.LO)
Causal reasoning is the primary bottleneck that Large Language Models (LLMs) must overcome to attain human-level intelligence. To address this, we introduce the Causal Chain of Prompting (C2P) as the first reasoning framework that equips current LLMs with causal reasoning capabilities. C2P operates autonomously, avoiding reliance on external tools or modules during both the causal learning and reasoning phases, and can be seamlessly implemented during the training or fine-tuning of LLMs. Experimental results across various benchmark datasets demonstrate a significant improvement in causal learning and subsequent reasoning accuracy of LLMs. We illustrate how C2P enhances LLMs' ability to causally reason in real-world scenarios, addressing complex problems in fields such as healthcare, medicine, economics, education, social sciences, environmental science, and marketing. With few-shot learning, GPT-4 Turbo using C2P with as few as six examples achieves significant performance improvements, boasting over a 33% increase in reasoning accuracy over the most state-of-the-art LLMs, which perform nearly randomly in similar circumstances. This demonstrates the transformative potential of integrating C2P into LLM training or fine-tuning processes, thereby empowering these models with advanced causal reasoning capabilities.
- [277] arXiv:2407.18074 [pdf, html, other]
-
Title: Principal-Agent Reinforcement LearningSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Contracts are the economic framework which allows a principal to delegate a task to an agent -- despite misaligned interests, and even without directly observing the agent's actions. In many modern reinforcement learning settings, self-interested agents learn to perform a multi-stage task delegated to them by a principal. We explore the significant potential of utilizing contracts to incentivize the agents. We model the delegated task as an MDP, and study a stochastic game between the principal and agent where the principal learns what contracts to use, and the agent learns an MDP policy in response. We present a learning-based algorithm for optimizing the principal's contracts, which provably converges to the subgame-perfect equilibrium of the principal-agent game. A deep RL implementation allows us to apply our method to very large MDPs with unknown transition dynamics. We extend our approach to multiple agents, and demonstrate its relevance to resolving a canonical sequential social dilemma with minimal intervention to agent rewards.
- [278] arXiv:2407.18078 [pdf, html, other]
-
Title: PEFT-U: Parameter-Efficient Fine-Tuning for User PersonalizationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The recent emergence of Large Language Models (LLMs) has heralded a new era of human-AI interaction. These sophisticated models, exemplified by Chat-GPT and its successors, have exhibited remarkable capabilities in language understanding. However, as these LLMs have undergone exponential growth, a crucial dimension that remains understudied is the personalization of these models. Large foundation models such as GPT-3 etc. focus on creating a universal model that serves a broad range of tasks and users. This approach emphasizes the model's generalization capabilities, treating users as a collective rather than as distinct individuals. While practical for many common applications, this one-size-fits-all approach often fails to address the rich tapestry of human diversity and individual needs. To explore this issue we introduce the PEFT-U Benchmark: a new dataset for building and evaluating NLP models for user personalization. \datasetname{} consists of a series of user-centered tasks containing diverse and individualized expressions where the preferences of users can potentially differ for the same input. Using PEFT-U, we explore the challenge of efficiently personalizing LLMs to accommodate user-specific preferences in the context of diverse user-centered tasks.
- [279] arXiv:2407.18085 [pdf, html, other]
-
Title: On the Design of Ethereum Data Availability Sampling: A Comprehensive Simulation StudyComments: 4 pages, 1 figureSubjects: Cryptography and Security (cs.CR)
This paper presents an in-depth exploration of Data Availability Sampling (DAS) and sharding mechanisms within decentralized systems through simulation-based analysis. DAS, a pivotal concept in blockchain technology and decentralized networks, is thoroughly examined to unravel its intricacies and assess its impact on system performance. Through the development of a simulator tailored explicitly for DAS, we embark on a comprehensive investigation into the parameters that influence system behavior and efficiency. A series of experiments are conducted within the simulated environment to validate theoretical formulations and dissect the interplay of DAS parameters. This includes an exploration of approaches such as custody by row, variations in validators per node, and malicious nodes. The outcomes of these experiments furnish insights into the efficacy of DAS protocols and pave the way for the formulation of optimization strategies geared towards enhancing decentralized network performance. Moreover, the findings serve as guidelines for future research endeavors, offering a nuanced understanding of the complexities inherent in decentralized systems. This study not only contributes to the theoretical understanding of DAS but also offers practical implications for the design, implementation, and optimization of decentralized systems.
- [280] arXiv:2407.18086 [pdf, html, other]
-
Title: Revealing urban area from mobile positioning dataSubjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
Researchers face the trade-off between publishing mobility data along with their papers while simultaneously protecting the privacy of the individuals. In addition to the fundamental anonymization process, other techniques, such as spatial discretization and, in certain cases, location concealing or complete removal, are applied to achieve these dual objectives. The primary research question is whether concealing the observation area is an adequate form of protection or whether human mobility patterns in urban areas are inherently revealing of location. The characteristics of the mobility data, such as the number of activity records or the number of unique users in a given spatial unit, reveal the silhouette of the urban landscape, which can be used to infer the identity of the city in question. It was demonstrated that even without disclosing the exact location, the patterns of human mobility can still reveal the urban area from which the data was collected. The presented locating method was tested on other cities using different open data sets and against coarser spatial discretization units. While publishing mobility data is essential for research, it was demonstrated that concealing the observation area is insufficient to prevent the identification of the urban area. Furthermore, using larger discretization units alone is an ineffective solution to the problem of the observation area re-identification. Instead of obscuring the observation area, noise should be added to the trajectories to prevent user identification.
- [281] arXiv:2407.18090 [pdf, other]
-
Title: On the Minimisation of Deterministic and History-Deterministic Generalised (co)B\"uchi AutomataSubjects: Formal Languages and Automata Theory (cs.FL)
We present a polynomial-time algorithm minimising the number of states of history-deterministic generalised coBüchi automata, building on the work of Abu Radi and Kupferman on coBüchi automata. On the other hand, we establish that the minimisation problem for both deterministic and history-deterministic generalised Büchi automata is NP-complete, as well as the problem of minimising at the same time the number of states and colours of history-deterministic generalised coBüchi automata.
- [282] arXiv:2407.18092 [pdf, html, other]
-
Title: Strategic Cost Selection in Participatory BudgetingPiotr Faliszewski, Łukasz Janeczko, Andrzej Kaczmarczyk, Grzegorz Lisowski, Piotr Skowron, Stanisław SzufaComments: 31 pagesSubjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
We study strategic behavior of project proposers in the context of approval-based participatory budgeting (PB). In our model we assume that the votes are fixed and known and the proposers want to set as high project prices as possible, provided that their projects get selected and the prices are not below the minimum costs of their delivery. We study the existence of pure Nash equilibria (NE) in such games, focusing on the AV/Cost, Phragmén, and Method of Equal Shares rules. Furthermore, we report an experimental study of strategic cost selection on real-life PB election data.
- [283] arXiv:2407.18096 [pdf, html, other]
-
Title: Privacy Threats and Countermeasures in Federated Learning for Internet of Things: A Systematic ReviewSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Federated Learning (FL) in the Internet of Things (IoT) environments can enhance machine learning by utilising decentralised data, but at the same time, it might introduce significant privacy and security concerns due to the constrained nature of IoT devices. This represents a research challenge that we aim to address in this paper. We systematically analysed recent literature to identify privacy threats in FL within IoT environments, and evaluate the defensive measures that can be employed to mitigate these threats. Using a Systematic Literature Review (SLR) approach, we searched five publication databases (Scopus, IEEE Xplore, Wiley, ACM, and Science Direct), collating relevant papers published between 2017 and April 2024, a period which spans from the introduction of FL until now. Guided by the PRISMA protocol, we selected 49 papers to focus our systematic review on. We analysed these papers, paying special attention to the privacy threats and defensive measures -- specifically within the context of IoT -- using inclusion and exclusion criteria tailored to highlight recent advances and critical insights. We identified various privacy threats, including inference attacks, poisoning attacks, and eavesdropping, along with defensive measures such as Differential Privacy and Secure Multi-Party Computation. These defences were evaluated for their effectiveness in protecting privacy without compromising the functional integrity of FL in IoT settings. Our review underscores the necessity for robust and efficient privacy-preserving strategies tailored for IoT environments. Notably, there is a need for strategies against replay, evasion, and model stealing attacks. Exploring lightweight defensive measures and emerging technologies such as blockchain may help improve the privacy of FL in IoT, leading to the creation of FL models that can operate under variable network conditions.
- [284] arXiv:2407.18097 [pdf, html, other]
-
Title: SSTD: Stripe-Like Space Target Detection using Single-Point SupervisionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Stripe-like space target detection (SSTD) plays a key role in enhancing space situational awareness and assessing spacecraft behaviour. This domain faces three challenges: the lack of publicly available datasets, interference from stray light and stars, and the variability of stripe-like targets, which complicates pixel-level annotation. In response, we introduces `AstroStripeSet', a pioneering dataset designed for SSTD, aiming to bridge the gap in academic resources and advance research in SSTD. Furthermore, we propose a novel pseudo-label evolution teacher-student framework with single-point supervision. This framework starts with generating initial pseudo-labels using the zero-shot capabilities of the Segment Anything Model (SAM) in a single-point setting, and refines these labels iteratively. In our framework, the fine-tuned StripeSAM serves as the teacher and the newly developed StripeNet as the student, consistently improving segmentation performance by improving the quality of pseudo-labels. We also introduce `GeoDice', a new loss function customized for the linear characteristics of stripe-like targets. Extensive experiments show that the performance of our approach matches fully supervised methods on all evaluation metrics, establishing a new state-of-the-art (SOTA) benchmark. Our dataset and code will be made publicly available.
- [285] arXiv:2407.18098 [pdf, html, other]
-
Title: Unraveling the Web of Disinformation: Exploring the Larger Context of State-Sponsored Influence Campaigns on TwitterJournal-ref: International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2024)Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Social media platforms offer unprecedented opportunities for connectivity and exchange of ideas; however, they also serve as fertile grounds for the dissemination of disinformation. Over the years, there has been a rise in state-sponsored campaigns aiming to spread disinformation and sway public opinion on sensitive topics through designated accounts, known as troll accounts. Past works on detecting accounts belonging to state-backed operations focus on a single campaign. While campaign-specific detection techniques are easier to build, there is no work done on developing systems that are campaign-agnostic and offer generalized detection of troll accounts unaffected by the biases of the specific campaign they belong to. In this paper, we identify several strategies adopted across different state actors and present a system that leverages them to detect accounts from previously unseen campaigns. We study 19 state-sponsored disinformation campaigns that took place on Twitter, originating from various countries. The strategies include sending automated messages through popular scheduling services, retweeting and sharing selective content and using fake versions of verified applications for pushing content. By translating these traits into a feature set, we build a machine learning-based classifier that can correctly identify up to 94% of accounts from unseen campaigns. Additionally, we run our system in the wild and find more accounts that could potentially belong to state-backed operations. We also present case studies to highlight the similarity between the accounts found by our system and those identified by Twitter.
- [286] arXiv:2407.18099 [pdf, html, other]
-
Title: Pose, Velocity and Landmark Position Estimation Using IMU and Bearing MeasurementsComments: 8 pages, 3 figuresSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
This paper investigates the estimation problem of the pose (orientation and position) and linear velocity of a rigid body, as well as the landmark positions, using an inertial measurement unit (IMU) and a monocular camera. First, we propose a globally exponentially stable (GES) linear time-varying (LTV) observer for the estimation of body-frame landmark positions and velocity, using IMU and monocular bearing measurements. Thereafter, using the gyro measurements, some landmarks known in the inertial frame and the estimates from the LTV observer, we propose a nonlinear pose observer on $\SO(3)\times \mathbb{R}^3$. The overall estimation system is shown to be almost globally asymptotically stable (AGAS) using the notion of almost global input-to-state stability (ISS). Interestingly, we show that with the knowledge (in the inertial frame) of a small number of landmarks, we can recover (under some conditions) the unknown positions (in the inertial frame) of a large number of landmarks. Numerical simulation results are presented to illustrate the performance of the proposed estimation scheme.
- [287] arXiv:2407.18100 [pdf, html, other]
-
Title: DINOv2 Rocks Geological Image Analysis: Classification, Segmentation, and InterpretabilitySubjects: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
This study investigates the interpretability, classification, and segmentation of CT-scan images of rock samples, with a particular focus on the application of DINOv2 within Geosciences. We compared various segmentation techniques to evaluate their efficacy, efficiency, and adaptability in geological image analysis. The methods assessed include the Otsu thresholding method, clustering techniques (K-means and fuzzy C-means), a supervised machine learning approach (Random Forest), and deep learning methods (UNet and DINOv2). We tested these methods using ten binary sandstone datasets and three multi-class calcite datasets. To begin, we provide a thorough interpretability analysis of DINOv2's features in the geoscientific context, discussing its suitability and inherent ability to process CT-scanned rock data. In terms of classification, the out-of-the-box DINOv2 demonstrates an impressive capability to perfectly classify rock images, even when the CT scans are out of its original training set. Regarding segmentation, thresholding and unsupervised methods, while fast, perform poorly despite image preprocessing, whereas supervised methods show better results. We underscore the computational demands of deep learning but highlight its minimal intervention, superior generalization, and performance without additional image preprocessing. Additionally, we observe a lack of correlation between a network's depth or the number of parameters and its performance. Our results show that a LoRA fine-tuned DINOv2 excels in out-of-distribution segmentation and significantly outperforms other methods in multi-class segmentation. By systematically comparing these methods, we identify the most efficient strategy for meticulous and laborious segmentation tasks. DINOv2 proves advantageous, achieving segmentations that could be described as "better than ground-truth" against relatively small training sets.
- [288] arXiv:2407.18108 [pdf, html, other]
-
Title: Graph Neural Ordinary Differential Equations for Coarse-Grained Socioeconomic DynamicsJames Koch, Pranab Roy Chowdhury, Heng Wan, Parin Bhaduri, Jim Yoon, Vivek Srikrishnan, W. Brent DanielSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
We present a data-driven machine-learning approach for modeling space-time socioeconomic dynamics. Through coarse-graining fine-scale observations, our modeling framework simplifies these complex systems to a set of tractable mechanistic relationships -- in the form of ordinary differential equations -- while preserving critical system behaviors. This approach allows for expedited 'what if' studies and sensitivity analyses, essential for informed policy-making. Our findings, from a case study of Baltimore, MD, indicate that this machine learning-augmented coarse-grained model serves as a powerful instrument for deciphering the complex interactions between social factors, geography, and exogenous stressors, offering a valuable asset for system forecasting and resilience planning.
- [289] arXiv:2407.18110 [pdf, html, other]
-
Title: MapTune: Advancing ASIC Technology Mapping via Reinforcement Learning Guided Library TuningComments: IEEE/ACM International Conference on Computer-Aided Design (ICCAD '24), October 27--31, 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Technology mapping involves mapping logical circuits to a library of cells. Traditionally, the full technology library is used, leading to a large search space and potential overhead. Motivated by randomly sampled technology mapping case studies, we propose MapTune framework that addresses this challenge by utilizing reinforcement learning to make design-specific choices during cell selection. By learning from the environment, MapTune refines the cell selection process, resulting in a reduced search space and potentially improved mapping quality.
The effectiveness of MapTune is evaluated on a wide range of benchmarks, different technology libraries and technology mappers. The experimental results demonstrate that MapTune achieves higher mapping accuracy and reducing delay/area across diverse circuit designs, technology libraries and mappers. The paper also discusses the Pareto-Optimal exploration and confirms the perpetual delay-area trade-off. Conducted on benchmark suites ISCAS 85/89, ITC/ISCAS 99, VTR8.0 and EPFL benchmarks, the post-technology mapping and post-sizing quality-of-results (QoR) have been significantly improved, with average Area-Delay Product (ADP) improvement of 22.54\% among all different exploration settings in MapTune. The improvements are consistently remained for four different technologies (7nm, 45nm, 130nm, and 180 nm) and two different mappers. - [290] arXiv:2407.18112 [pdf, html, other]
-
Title: Keypoint Promptable Re-IdentificationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Occluded Person Re-Identification (ReID) is a metric learning task that involves matching occluded individuals based on their appearance. While many studies have tackled occlusions caused by objects, multi-person occlusions remain less explored. In this work, we identify and address a critical challenge overlooked by previous occluded ReID methods: the Multi-Person Ambiguity (MPA) arising when multiple individuals are visible in the same bounding box, making it impossible to determine the intended ReID target among the candidates. Inspired by recent work on prompting in vision, we introduce Keypoint Promptable ReID (KPR), a novel formulation of the ReID problem that explicitly complements the input bounding box with a set of semantic keypoints indicating the intended target. Since promptable re-identification is an unexplored paradigm, existing ReID datasets lack the pixel-level annotations necessary for prompting. To bridge this gap and foster further research on this topic, we introduce Occluded-PoseTrack ReID, a novel ReID dataset with keypoints labels, that features strong inter-person occlusions. Furthermore, we release custom keypoint labels for four popular ReID benchmarks. Experiments on person retrieval, but also on pose tracking, demonstrate that our method systematically surpasses previous state-of-the-art approaches on various occluded scenarios. Our code, dataset and annotations are available at this https URL.
- [291] arXiv:2407.18114 [pdf, html, other]
-
Title: Unsupervised Training of Neural Cellular Automata on Edge DevicesSubjects: Machine Learning (cs.LG)
The disparity in access to machine learning tools for medical imaging across different regions significantly limits the potential for universal healthcare innovation, particularly in remote areas. Our research addresses this issue by implementing Neural Cellular Automata (NCA) training directly on smartphones for accessible X-ray lung segmentation. We confirm the practicality and feasibility of deploying and training these advanced models on five Android devices, improving medical diagnostics accessibility and bridging the tech divide to extend machine learning benefits in medical imaging to low- and middle-income countries (LMICs). We further enhance this approach with an unsupervised adaptation method using the novel Variance-Weighted Segmentation Loss (VWSL), which efficiently learns from unlabeled data by minimizing the variance from multiple NCA predictions. This strategy notably improves model adaptability and performance across diverse medical imaging contexts without the need for extensive computational resources or labeled datasets, effectively lowering the participation threshold. Our methodology, tested on three multisite X-ray datasets -- Padchest, ChestX-ray8, and MIMIC-III -- demonstrates improvements in segmentation Dice accuracy by 0.7 to 2.8%, compared to the classic Med-NCA. Additionally, in extreme cases where no digital copy is available and images must be captured by a phone from an X-ray lightbox or monitor, VWSL enhances Dice accuracy by 5-20%, demonstrating the method's robustness even with suboptimal image sources.
- [292] arXiv:2407.18119 [pdf, html, other]
-
Title: Tracking linguistic information in transformer-based sentence embeddings through targeted sparsificationComments: 12 pages, 9 figures, 1 table, published in RepL4NLP 2024Subjects: Computation and Language (cs.CL)
Analyses of transformer-based models have shown that they encode a variety of linguistic information from their textual input. While these analyses have shed a light on the relation between linguistic information on one side, and internal architecture and parameters on the other, a question remains unanswered: how is this linguistic information reflected in sentence embeddings? Using datasets consisting of sentences with known structure, we test to what degree information about chunks (in particular noun, verb or prepositional phrases), such as grammatical number, or semantic role, can be localized in sentence embeddings. Our results show that such information is not distributed over the entire sentence embedding, but rather it is encoded in specific regions. Understanding how the information from an input text is compressed into sentence embeddings helps understand current transformer models and help build future explainable neural models.
- [293] arXiv:2407.18121 [pdf, html, other]
-
Title: Efficient Inference of Vision Instruction-Following Models with Elastic CacheZuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen LuComments: Accepted to ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at this https URL
- [294] arXiv:2407.18122 [pdf, html, other]
-
Title: On de Bruijn Arrays Codes, Part I: Nonlinear CodesSubjects: Information Theory (cs.IT)
A de Bruijn arrays code is a set of $r \times s$ binary doubly-periodic arrays such that each binary $n \times m$ matrix is contained exactly once as a window in one of the arrays. Such a set of arrays can be viewed as a two-dimensional generalization of a perfect factor in the de Bruijn graph. Necessary conditions for the existence of such arrays are given. Several direct constructions and recursive constructions for such arrays are given. A framework for a theory of two-dimensional feedback shift register which is akin to (one-dimensional) feedback shift registers is suggested.
- [295] arXiv:2407.18124 [pdf, html, other]
-
Title: PIR Codes, Unequal-Data-Demand Codes, and the Griesmer BoundComments: 11 pagesJournal-ref: Proceedings WCC 2024Subjects: Information Theory (cs.IT); Combinatorics (math.CO)
Unequal Error-Protecting (UEP) codes are error-correcting (EC) codes designed to protect some parts of the encoded data better than other parts. Here, we introduce a similar generalization of PIR codes that we call Unequal-Data-Demand (UDD) PIR codes. These codes are PIR-type codes designed for the scenario where some parts of the encoded data are in higher demand than other parts. We generalize various results for PIR codes to UDD codes. Our main contribution is a new approach to the Griesmer bound for linear EC codes involving an Integer Linear Programming (ILP) problem that generalizes to linear UEP codes and linear UDD PIR codes.
- [296] arXiv:2407.18125 [pdf, html, other]
-
Title: Self-supervised pre-training with diffusion model for few-shot landmark detection in x-ray imagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In the last few years, deep neural networks have been extensively applied in the medical domain for different tasks, ranging from image classification and segmentation to landmark detection. However, the application of these technologies in the medical domain is often hindered by data scarcity, both in terms of available annotations and images. This study introduces a new self-supervised pre-training protocol based on diffusion models for landmark detection in x-ray images. Our results show that the proposed self-supervised framework can provide accurate landmark detection with a minimal number of available annotated training images (up to 50), outperforming ImageNet supervised pre-training and state-of-the-art self-supervised pre-trainings for three popular x-ray benchmark datasets. To our knowledge, this is the first exploration of diffusion models for self-supervised learning in landmark detection, which may offer a valuable pre-training approach in few-shot regimes, for mitigating data scarcity.
- [297] arXiv:2407.18128 [pdf, html, other]
-
Title: Estimating Earthquake Magnitude in Sentinel-1 Imagery via RankingSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Earthquakes are commonly estimated using physical seismic stations, however, due to the installation requirements and costs of these stations, global coverage quickly becomes impractical. An efficient and lower-cost alternative is to develop machine learning models to globally monitor earth observation data to pinpoint regions impacted by these natural disasters. However, due to the small amount of historically recorded earthquakes, this becomes a low-data regime problem requiring algorithmic improvements to achieve peak performance when learning to regress earthquake magnitude. In this paper, we propose to pose the estimation of earthquake magnitudes as a metric-learning problem, training models to not only estimate earthquake magnitude from Sentinel-1 satellite imagery but to additionally rank pairwise samples. Our experiments show at max a 30%+ improvement in MAE over prior regression-only based methods, particularly transformer-based architectures.
- [298] arXiv:2407.18129 [pdf, html, other]
-
Title: Dallah: A Dialect-Aware Multimodal Large Language Model for ArabicSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
- [299] arXiv:2407.18131 [pdf, html, other]
-
Title: Reachability for Multi-Priced Timed Automata with Positive and Negative RatesSubjects: Formal Languages and Automata Theory (cs.FL)
Multi-priced timed automata (MPTA) are timed automata with observer
variables whose derivatives can change from one location to another.
Observers are write-only variables, that is, they do not affect the control
flow of the automaton; thus MPTA lie between timed and hybrid
automata in expressiveness. Previous work considered observers with
non-negative slope in every location. In this paper we treat
observers that have both positive and negative rates. Our
main result is an algorithm to decide a gap version of the
reachability problem for this variant of MPTA. We translate the
gap reachability problem into a gap satisfiability problem for mixed
integer-real systems of nonlinear constraints. Our main technical
contribution -- a result of independent interest -- is a procedure
to solve such contraints via a combination of branch-and-bound
and relaxation-and-rounding. - [300] arXiv:2407.18134 [pdf, html, other]
-
Title: $\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity GraphsVlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCunSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.
- [301] arXiv:2407.18137 [pdf, html, other]
-
Title: XS-VID: An Extremely Small Video Object Detection DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV)
Small Video Object Detection (SVOD) is a crucial subfield in modern computer vision, essential for early object discovery and detection. However, existing SVOD datasets are scarce and suffer from issues such as insufficiently small objects, limited object categories, and lack of scene diversity, leading to unitary application scenarios for corresponding methods. To address this gap, we develop the XS-VID dataset, which comprises aerial data from various periods and scenes, and annotates eight major object categories. To further evaluate existing methods for detecting extremely small objects, XS-VID extensively collects three types of objects with smaller pixel areas: extremely small (\textit{es}, $0\sim12^2$), relatively small (\textit{rs}, $12^2\sim20^2$), and generally small (\textit{gs}, $20^2\sim32^2$). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{this https URL}.
- [302] arXiv:2407.18140 [pdf, html, other]
-
Title: Influence Vectors Control for Robots Using Cellular-like Binary ActuatorsJournal-ref: IEEE Transactions on Robotics ( Volume: 30, Issue: 3, June 2014)Subjects: Robotics (cs.RO)
Robots using cellular-like redundant binary actuators could outmatch electric-gearmotor robotic systems in terms of reliability, force-to-weight ratio and cost. This paper presents a robust fault tolerant control scheme that is designed to meet the control challenges encountered by such robots, i.e., discrete actuator inputs, complex system modeling and cross-coupling between actuators. In the proposed scheme, a desired vectorial system output, such as a position or a force, is commanded by recruiting actuators based on their influence vectors on the output. No analytical model of the system is needed; influence vectors are identified experimentally by sequentially activating each actuator. For position control tasks, the controller uses a probabilistic approach and a genetic algorithm to determine an optimal combination of actuators to recruit. For motion control tasks, the controller uses a sliding mode approach and independent recruiting decision for each actuator. Experimental results on a four degrees of freedom binary manipulator with twenty actuators confirm the method's effectiveness, and its ability to tolerate massive perturbations and numerous actuator failures.
- [303] arXiv:2407.18141 [pdf, html, other]
-
Title: IRIS: Wireless Ring for Vision-based Smart Home InteractionMaruchi Kim, Antonio Glenn, Bandhav Veluri, Yunseo Lee, Eyoel Gebre, Aditya Bagaria, Shwetak Patel, Shyamnath GollakotaComments: 15 pages, 17 figures, 6 tables, to be published in UIST 2024Subjects: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Integrating cameras into wireless smart rings has been challenging due to size and power constraints. We introduce IRIS, the first wireless vision-enabled smart ring system for smart home interactions. Equipped with a camera, Bluetooth radio, inertial measurement unit (IMU), and an onboard battery, IRIS meets the small size, weight, and power (SWaP) requirements for ring devices. IRIS is context-aware, adapting its gesture set to the detected device, and can last for 16-24 hours on a single charge. IRIS leverages the scene semantics to achieve instance-level device recognition. In a study involving 23 participants, IRIS consistently outpaced voice commands, with a higher proportion of participants expressing a preference for IRIS over voice commands regarding toggling a device's state, granular control, and social acceptability. Our work pushes the boundary of what is possible with ring form-factor devices, addressing system challenges and opening up novel interaction capabilities.
- [304] arXiv:2407.18143 [pdf, html, other]
-
Title: Maximum Entropy On-Policy Actor-Critic via Entropy Advantage EstimationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy. This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes. However, its practical application in straightforward on-policy actor-critic settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework improves policy optimisation performance in both MuJoCo and Procgen tasks. Additionally, our results highlight MaxEnt RL's capacity to enhance generalisation.
- [305] arXiv:2407.18145 [pdf, html, other]
-
Title: Taxonomy-Aware Continual Semantic Segmentation in Hyperbolic Spaces for Open-World PerceptionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Semantic segmentation models are typically trained on a fixed set of classes, limiting their applicability in open-world scenarios. Class-incremental semantic segmentation aims to update models with emerging new classes while preventing catastrophic forgetting of previously learned ones. However, existing methods impose strict rigidity on old classes, reducing their effectiveness in learning new incremental classes. In this work, we propose Taxonomy-Oriented Poincaré-regularized Incremental-Class Segmentation (TOPICS) that learns feature embeddings in hyperbolic space following explicit taxonomy-tree structures. This supervision provides plasticity for old classes, updating ancestors based on new classes while integrating new classes at fitting positions. Additionally, we maintain implicit class relational constraints on the geometric basis of the Poincaré ball. This ensures that the latent space can continuously adapt to new constraints while maintaining a robust structure to combat catastrophic forgetting. We also establish eight realistic incremental learning protocols for autonomous driving scenarios, where novel classes can originate from known classes or the background. Extensive evaluations of TOPICS on the Cityscapes and Mapillary Vistas 2.0 benchmarks demonstrate that it achieves state-of-the-art performance. We make the code and trained models publicly available at this http URL.
- [306] arXiv:2407.18146 [pdf, other]
-
Title: Adaptable Deep Joint Source-and-Channel Coding for Small Satellite ApplicationsSubjects: Networking and Internet Architecture (cs.NI)
Earth observation with small satellites serves a wide range of relevant applications. However, significant advances in sensor technology (e.g., higher resolution, multiple spectrums beyond visible light) in combination with challenging channel characteristics lead to a communication bottleneck when transmitting the collected data to Earth. Recently, joint source coding, channel coding, and modulation based on neuronal networks has been proposed to combine image compression and communication. Though this approach achieves promising results when applied to standard terrestrial channel models, it remains an open question whether it is suitable for the more complicated and quickly varying satellite communication channel. In this paper, we consider a detailed satellite channel model accounting for different shadowing conditions and train an encoder-decoder architecture with realistic Sentinel-2 satellite imagery. In addition, to reduce the overhead associated with applying multiple neural networks for various channel states, we leverage attention modules and train a single adaptable neural network that covers a wide range of different channel conditions. Our evaluation results show that the proposed approach achieves similar performance when compared to less space-efficient schemes that utilize separate neuronal networks for differing channel conditions.
- [307] arXiv:2407.18147 [pdf, html, other]
-
Title: The FIGNEWS Shared Task on News Media NarrativesWajdi Zaghouani (1), Mustafa Jarrar (2), Nizar Habash (3), Houda Bouamor (4), Imed Zitouni (5), Mona Diab (6), Samhaa R. El-Beltagy (7), Muhammed AbuOdeh (3) ((1) Northwestern University in Qatar, (2) Birzeit University, (3) New York University Abu Dhabi, (4) Carnegie Mellon University Qatar, (5) Google, (6) Carnegie Mellon University, (7) Newgiza University)Comments: 18 pages, 10 tables, 1 figure, accepted to ArabicNLP 2024 co-located with ACL 2024Subjects: Computation and Language (cs.CL)
We present an overview of the FIGNEWS shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. The shared task addresses bias and propaganda annotation in multilingual news posts. We focus on the early days of the Israel War on Gaza as a case study. The task aims to foster collaboration in developing annotation guidelines for subjective tasks by creating frameworks for analyzing diverse narratives highlighting potential bias and propaganda. In a spirit of fostering and encouraging diversity, we address the problem from a multilingual perspective, namely within five languages: English, French, Arabic, Hebrew, and Hindi. A total of 17 teams participated in two annotation subtasks: bias (16 teams) and propaganda (6 teams). The teams competed in four evaluation tracks: guidelines development, annotation quality, annotation quantity, and consistency. Collectively, the teams produced 129,800 data points. Key findings and implications for the field are discussed.
- [308] arXiv:2407.18148 [pdf, html, other]
-
Title: StraightLine: An End-to-End Resource-Aware Scheduler for Machine Learning Application RequestsComments: 6 pages, 8 figures, to appear in AIoTC'24Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
The life cycle of machine learning (ML) applications consists of two stages: model development and model deployment. However, traditional ML systems (e.g., training-specific or inference-specific systems) focus on one particular stage or phase of the life cycle of ML applications. These systems often aim at optimizing model training or accelerating model inference, and they frequently assume homogeneous infrastructure, which may not always reflect real-world scenarios that include cloud data centers, local servers, containers, and serverless platforms. We present StraightLine, an end-to-end resource-aware scheduler that schedules the optimal resources (e.g., container, virtual machine, or serverless) for different ML application requests in a hybrid infrastructure. The key innovation is an empirical dynamic placing algorithm that intelligently places requests based on their unique characteristics (e.g., request frequency, input data size, and data distribution). In contrast to existing ML systems, StraightLine offers end-to-end resource-aware placement, thereby it can significantly reduce response time and failure rate for model deployment when facing different computing resources in the hybrid infrastructure.
- [309] arXiv:2407.18154 [pdf, html, other]
-
Title: Identification of a time-varying SIR Model for Covid-19Comments: 16 pages, 10 figuresSubjects: Systems and Control (eess.SY); Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE)
Throughout human history, epidemics have been a constant presence. Understanding their dynamics is essential to predict scenarios and make substantiated decisions. Mathematical models are powerful tools to describe an epidemic behavior. Among the most used, the compartmental ones stand out, dividing population into classes with well-defined characteristics. One of the most known is the $SIR$ model, based on a set of differential equations describing the rates of change of three categories over time. These equations take into account parameters such as the disease transmission rate and the recovery rate, which both change over time. However, classical models use constant parameters and can not describe the behavior of a disease over long periods. In this work, it is proposed a $SIR$ model with time-varying transmission rate parameter with a method to estimate this parameter based on an optimization problem, which minimizes the sum of the squares of the errors between the model and historical data. Additionally, based on the infection rates determined by the algorithm, the model's ability to predict disease activity in future scenarios was also investigated. Epidemic data released by the government of the State of Rio Grande do Sul in Brazil was used to evaluate the models, where the models shown a very good forecasting ability, resulting in errors for predicting the total number of accumulated infected persons of 0.13% for 7 days ahead and 0.6% for 14 days ahead.
- [310] arXiv:2407.18155 [pdf, html, other]
-
Title: Test2VA: Reusing GUI Test Cases for Voice Assistant Features Development in Mobile ApplicationsComments: 10 pagesSubjects: Software Engineering (cs.SE)
Voice Assistant (VA) in smartphones has become very popular with millions of users nowadays. A key trend is the rise of custom VA embedding, which enables users to perform the customized tasks of their favorite app through voice control. However, with such a great demand, little effort has been made to support app developers in VA development. Moreover, many user-oriented VA control approaches even increase the programming burden on developers. To reduce the workload and improve code efficiency, in this paper, we propose a novel approach, Test2VA, that reuses the test code of an application to support its VA development. Specifically, Test2VA extracts the task completion pattern from the GUI test code and then generates an execution method to perform the same task in general. To identify the pattern, Test2VA uses a mutation-based exploration to detect the mutable GUI event in the test case and later parameterize it in the VA method. We conducted an evaluation on 48 test cases from eight real-world applications. The results show that Test2VA correctly detects 75.68% of the mutable events from 48 original test cases and then generates 33 methods and have them successfully executed and manually examined.
- [311] arXiv:2407.18157 [pdf, html, other]
-
Title: Enhanced Privacy Bound for Shuffle Model with Personalized PrivacySubjects: Cryptography and Security (cs.CR); Databases (cs.DB)
The shuffle model of Differential Privacy (DP) is an enhanced privacy protocol which introduces an intermediate trusted server between local users and a central data curator. It significantly amplifies the central DP guarantee by anonymizing and shuffling the local randomized data. Yet, deriving a tight privacy bound is challenging due to its complicated randomization protocol. While most existing work are focused on unified local privacy settings, this work focuses on deriving the central privacy bound for a more practical setting where personalized local privacy is required by each user. To bound the privacy after shuffling, we first need to capture the probability of each user generating clones of the neighboring data points. Second, we need to quantify the indistinguishability between two distributions of the number of clones on neighboring datasets. Existing works either inaccurately capture the probability, or underestimate the indistinguishability between neighboring datasets. Motivated by this, we develop a more precise analysis, which yields a general and tighter bound for arbitrary DP mechanisms. Firstly, we derive the clone-generating probability by hypothesis testing %from a randomizer-specific perspective, which leads to a more accurate characterization of the probability. Secondly, we analyze the indistinguishability in the context of $f$-DP, where the convexity of the distributions is leveraged to achieve a tighter privacy bound. Theoretical and numerical results demonstrate that our bound remarkably outperforms the existing results in the literature.
- [312] arXiv:2407.18159 [pdf, html, other]
-
Title: Optimal Assignment and Motion Control in Two-Class Continuum SwarmsComments: 12 pages, 7 figuresSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
We consider optimal swarm control problems where two different classes of agents are present. Continuum idealizations of large-scale swarms are used where the dynamics describe the evolution of the spatially-distributed densities of each agent class. The problem formulation we adopt is motivated by applications where agents of one class are assigned to agents of the other class, which we refer to as demand and resource agents respectively. Assignments have costs related to the distances between mutually assigned agents, and the overall cost of an assignment is quantified by a Wasserstein distance between the densities of the two agent classes. When agents can move, the assignment cost can decrease at the expense of a physical motion cost, and this tradeoff sets up a nonlinear, infinite-dimensional optimal control problem. We show that in one spatial dimension, this problem can be converted to an infinite-dimensional, but decoupled, linear-quadratic (LQ) tracking problem when expressed in terms of the respective quantile functions. Solutions are given in the general one-dimensional case, as well as in the special cases of constant and periodically time-varying demands.
- [313] arXiv:2407.18169 [pdf, html, other]
-
Title: In Search of Metrics to Guide Developer-Based Refactoring RecommendationsSubjects: Software Engineering (cs.SE)
Context. Source code refactoring is a well-established approach to improving source code quality without compromising its external behavior. Motivation. The literature described the benefits of refactoring, yet its application in practice is threatened by the high cost of time, resource allocation, and effort required to perform it continuously. Providing refactoring recommendations closer to what developers perceive as relevant may support the broader application of refactoring in practice and drive prioritization efforts. Aim. In this paper, we aim to foster the design of a developer-based refactoring recommender, proposing an empirical study into the metrics that study the developer's willingness to apply refactoring operations. We build upon previous work describing the developer's motivations for refactoring and investigate how product and process metrics may grasp those motivations. Expected Results. We will quantify the value of product and process metrics in grasping developers' motivations to perform refactoring, thus providing a catalog of metrics for developer-based refactoring recommenders to use.
- [314] arXiv:2407.18170 [pdf, html, other]
-
Title: RIDA: A Robust Attack Framework on Incomplete GraphsSubjects: Machine Learning (cs.LG)
Graph Neural Networks (GNNs) are vital in data science but are increasingly susceptible to adversarial attacks. To help researchers develop more robust GNN models, it's essential to focus on designing strong attack models as foundational benchmarks and guiding references. Among adversarial attacks, gray-box poisoning attacks are noteworthy due to their effectiveness and fewer constraints. These attacks exploit GNNs' need for retraining on updated data, thereby impacting their performance by perturbing these datasets. However, current research overlooks the real-world scenario of incomplete this http URL address this gap, we introduce the Robust Incomplete Deep Attack Framework (RIDA). It is the first algorithm for robust gray-box poisoning attacks on incomplete graphs. The approach innovatively aggregates distant vertex information and ensures powerful data utilization.Extensive tests against 9 SOTA baselines on 3 real-world datasets demonstrate RIDA's superiority in handling incompleteness and high attack performance on the incomplete graph.
- [315] arXiv:2407.18175 [pdf, html, other]
-
Title: Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision TransformersZhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman FangComments: Accepted by ICS 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for hardware implementation while preserving the accuracy. First, Quasar-ViT trains a supernet using our row-wise flexible mixed-precision quantization scheme, mixed-precision weight entanglement, and supernet layer scaling techniques. Then, it applies an efficient hardware-oriented search algorithm, integrated with hardware latency and resource modeling, to determine a series of optimal subnets from supernet under different inference latency targets. Finally, we propose a series of model-adaptive designs on the FPGA platform to support the architecture search and mitigate the gap between the theoretical computation reduction and the practical inference speedup. Our searched models achieve 101.5, 159.6, and 251.6 frames-per-second (FPS) inference speed on the AMD/Xilinx ZCU102 FPGA with 80.4%, 78.6%, and 74.9% top-1 accuracy, respectively, for the ImageNet dataset, consistently outperforming prior works.
- [316] arXiv:2407.18178 [pdf, html, other]
-
Title: PianoMime: Learning a Generalist, Dexterous Piano Player from Internet DemonstrationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
In this work, we introduce PianoMime, a framework for training a piano-playing agent using internet demonstrations. The internet is a promising source of large-scale demonstrations for training our robot agents. In particular, for the case of piano-playing, Youtube is full of videos of professional pianists playing a wide myriad of songs. In our work, we leverage these demonstrations to learn a generalist piano-playing agent capable of playing any arbitrary song. Our framework is divided into three parts: a data preparation phase to extract the informative features from the Youtube videos, a policy learning phase to train song-specific expert policies from the demonstrations and a policy distillation phase to distil the policies into a single generalist agent. We explore different policy designs to represent the agent and evaluate the influence of the amount of training data on the generalization capability of the agent to novel songs not available in the dataset. We show that we are able to learn a policy with up to 56\% F1 score on unseen songs.
- [317] arXiv:2407.18181 [pdf, html, other]
-
Title: Gene Regulatory Network Inference from Pre-trained Single-Cell Transcriptomics Transformer with Joint Graph LearningComments: Accepted into the ICML 2024 AI for Science workshopSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Inferring gene regulatory networks (GRNs) from single-cell RNA sequencing (scRNA-seq) data is a complex challenge that requires capturing the intricate relationships between genes and their regulatory interactions. In this study, we tackle this challenge by leveraging the single-cell BERT-based pre-trained transformer model (scBERT), trained on extensive unlabeled scRNA-seq data, to augment structured biological knowledge from existing GRNs. We introduce a novel joint graph learning approach that combines the rich contextual representations learned by pre-trained single-cell language models with the structured knowledge encoded in GRNs using graph neural networks (GNNs). By integrating these two modalities, our approach effectively reasons over boththe gene expression level constraints provided by the scRNA-seq data and the structured biological knowledge inherent in GRNs. We evaluate our method on human cell benchmark datasets from the BEELINE study with cell type-specific ground truth networks. The results demonstrate superior performance over current state-of-the-art baselines, offering a deeper understanding of cellular regulatory mechanisms.
- [318] arXiv:2407.18183 [pdf, html, other]
-
Title: Signaling Rate and Performance of RIS Reconfiguration and Handover Management in Next Generation Mobile NetworksComments: This paper is uploaded here for research community, thus it is for non-commercial purposesSubjects: Networking and Internet Architecture (cs.NI)
We consider the problem of signaling rate and performance for an efficient control and management of RIS reconfigurations and handover in next generation mobile networks. To this end, we first analytically determine the rates of RIS reconfigurations and handover using a stochastic geometry network model. We derive closed-form expressions of these rates while taking into account static obstacles (both known and unknown), self-blockage, RIS location density, and variations in the angle and direction of user mobility. Based on the rates derived, we analyze the signaling rates of a sample novel signaling protocol, which we propose as an extension of an handover signaling protocol standard in mobile networks. The results quantify the impact of known and unknown obstacles on the RIS and handover reconfiguration rate as function of device density and mobility. We use the proposed analysis to evaluate the signaling overhead due to RIS reconfigurations, as well as to dimension the related RIS control plane server capacity in the network management system. To the best of our knowledge, this is the first analytical model to derive the closed form expressions of RIS reconfiguration rates, along with handover rates, and relate its statistical properties to the signaling rate and performance in next generation mobile networks.
- [319] arXiv:2407.18184 [pdf, html, other]
-
Title: AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionSubjects: Machine Learning (cs.LG)
Epitope identification is vital for antibody design yet challenging due to the inherent variability in antibodies. While many deep learning methods have been developed for general protein binding site prediction tasks, whether they work for epitope prediction remains an understudied research question. The challenge is also heightened by the lack of a consistent evaluation pipeline with sufficient dataset size and epitope diversity. We introduce a filtered antibody-antigen complex structure dataset, AsEP (Antibody-specific Epitope Prediction). AsEP is the largest of its kind and provides clustered epitope groups, allowing the community to develop and test novel epitope prediction methods. AsEP comes with an easy-to-use interface in Python and pre-built graph representations of each antibody-antigen complex while also supporting customizable embedding methods. Based on this new dataset, we benchmarked various representative general protein-binding site prediction methods and find that their performances are not satisfactory as expected for epitope prediction. We thus propose a new method, WALLE, that leverages both protein language models and graph neural networks. WALLE demonstrate about 5X performance gain over existing methods. Our empirical findings evidence that epitope prediction benefits from combining sequential embeddings provided by language models and geometrical information from graph representations, providing a guideline for future method design. In addition, we reformulate the task as bipartite link prediction, allowing easy model performance attribution and interpretability. We open-source our data and code at this https URL.
- [320] arXiv:2407.18200 [pdf, html, other]
-
Title: Sparse Incremental Aggregation in Multi-Hop Federated LearningComments: This paper is accepted for the 25th IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) conferenceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
This paper investigates federated learning (FL) in a multi-hop communication setup, such as in constellations with inter-satellite links. In this setup, part of the FL clients are responsible for forwarding other client's results to the parameter server. Instead of using conventional routing, the communication efficiency can be improved significantly by using in-network model aggregation at each intermediate hop, known as incremental aggregation (IA). Prior works [1] have indicated diminishing gains for IA under gradient sparsification. Here we study this issue and propose several novel correlated sparsification methods for IA. Numerical results show that, for some of these algorithms, the full potential of IA is still available under sparsification without impairing convergence. We demonstrate a 15x improvement in communication efficiency over conventional routing and a 11x improvement over state-of-the-art (SoA) sparse IA.
- [321] arXiv:2407.18201 [pdf, html, other]
-
Title: Semi-Classical Subspaces, The No Synchronization Law, and MoreComments: arXiv admin note: text overlap with arXiv:2402.13049Subjects: Computational Complexity (cs.CC); Quantum Physics (quant-ph)
This paper looks at the intersection of algorithmic information theory and physics, namely quantum mechanics, thermodynamics, and black holes. We discuss theorems which characterize the barrier between the quantum world and the classical realm. The notion of a "semi-classical subspace" is introduced. The No Synchronization Law is detailed, which says separate and isolated physical systems evolving over time cannot have thermodynamic algorithmic entropies that are in synch. We look at future work involving the Kolmogorov complexity of black holes.
- [322] arXiv:2407.18207 [pdf, html, other]
-
Title: Geometry Fidelity for Spherical ImagesAnders Christensen, Nooshin Mojab, Khushman Patel, Karan Ahuja, Zeynep Akata, Ole Winther, Mar Gonzalez-Franco, Andrea ColacoComments: Accepted at ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.
- [323] arXiv:2407.18209 [pdf, html, other]
-
Title: SuperFlow: A Fully-Customized RTL-to-GDS Design Automation Flow for Adiabatic Quantum-Flux-Parametron Superconducting CircuitsYanyue Xie, Peiyan Dong, Geng Yuan, Zhengang Li, Masoud Zabihi, Chao Wu, Sung-En Chang, Xufeng Zhang, Xue Lin, Caiwen Ding, Nobuyuki Yoshikawa, Olivia Chen, Yanzhi WangComments: Accepted by DATE 2024Subjects: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR)
Superconducting circuits, like Adiabatic Quantum-Flux-Parametron (AQFP), offer exceptional energy efficiency but face challenges in physical design due to sophisticated spacing and timing constraints. Current design tools often neglect the importance of constraint adherence throughout the entire design flow. In this paper, we propose SuperFlow, a fully-customized RTL-to-GDS design flow tailored for AQFP devices. SuperFlow leverages a synthesis tool based on CMOS technology to transform any input RTL netlist to an AQFP-based netlist. Subsequently, we devise a novel place-and-route procedure that simultaneously considers wirelength, timing, and routability for AQFP circuits. The process culminates in the generation of the AQFP circuit layout, followed by a Design Rule Check (DRC) to identify and rectify any layout violations. Our experimental results demonstrate that SuperFlow achieves 12.8% wirelength improvement on average and 12.1% better timing quality compared with previous state-of-the-art placers for AQFP circuits.
- [324] arXiv:2407.18213 [pdf, html, other]
-
Title: Exploring Scaling Trends in LLM RobustnessNikolhaus Howe, Michał Zajac, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, Adam GleaveComments: 31 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Language model capabilities predictably improve from scaling a model's size and training data. Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities. Yet these models are vulnerable to adversarial prompts, such as "jailbreaks" that hijack models to perform undesired behaviors, posing a significant risk of misuse. Prior work indicates that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically, finding that larger models respond substantially better to adversarial training, but there is little to no benefit from model scale in the absence of explicit defenses.
- [325] arXiv:2407.18215 [pdf, html, other]
-
Title: Tool-Assisted Learning of Computational ReductionsSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Computational reductions are an important and powerful concept in computer science. However, they are difficult for many students to grasp. In this paper, we outline a concept for how the learning of reductions can be supported by educational support systems. We present an implementation of the concept within such a system, concrete web-based and interactive learning material for reductions, and report on our experiences using the material in a large introductory course on theoretical computer science.
- [326] arXiv:2407.18216 [pdf, html, other]
-
Title: Fast computation of the period and of the shortest cover of a string using its Character-Distance-Sampling representationSubjects: Data Structures and Algorithms (cs.DS)
Computing regularities in strings is essential for a better understanding of their structures. Among regularities, periods and covers are the easiest to compute and the more informative. Lately new interesting string matching results have been achieved using different sampling techniques. One of these technique, called Character-Distance-Sampling (\texttt{CDS}) consists of representing a string by storing the distance between the positions of selected characters called pivots. Here we select as pivots only the first character of the string and use its \texttt{CDS} representation for computing its period and its shortest cover. Experimental results show that the proposed methods are much faster than classical methods for computing these two features.
- [327] arXiv:2407.18218 [pdf, other]
-
Title: An NKCS Model of Bookchins CommunalismComments: 10 pages. arXiv admin note: text overlap with arXiv:2302.01694Subjects: Neural and Evolutionary Computing (cs.NE)
The NKCS model was introduced to explore coevolutionary systems, that is, systems in which multiple species are closely interconnected. The fitness landscapes of the species are coupled to a controllable amount, where the underlying properties of the individual landscapes are also controllable. No previous work has explored the use of hierarchical control within the model. This paper explores the effects of using a confederation, based on Bookchins communalism, and a single point of global control. Significant changes in behaviour from the traditional model are seen across the parameter space.
- [328] arXiv:2407.18219 [pdf, html, other]
-
Title: Recursive Introspection: Teaching Language Model Agents How to Self-ImproveSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
A central piece in enabling intelligent agentic behavior in foundation models is to make them capable of introspecting upon their behavior, reasoning, and correcting their mistakes as more computation or interaction is available. Even the strongest proprietary large language models (LLMs) do not quite exhibit the ability of continually improving their responses sequentially, even in scenarios where they are explicitly told that they are making a mistake. In this paper, we develop RISE: Recursive IntroSpEction, an approach for fine-tuning LLMs to introduce this capability, despite prior work hypothesizing that this capability may not be possible to attain. Our approach prescribes an iterative fine-tuning procedure, which attempts to teach the model how to alter its response after having executed previously unsuccessful attempts to solve a hard test-time problem, with optionally additional environment feedback. RISE poses fine-tuning for a single-turn prompt as solving a multi-turn Markov decision process (MDP), where the initial state is the prompt. Inspired by principles in online imitation learning and reinforcement learning, we propose strategies for multi-turn data collection and training so as to imbue an LLM with the capability to recursively detect and correct its previous mistakes in subsequent iterations. Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation. We also find that RISE scales well, often attaining larger benefits with more capable models. Our analysis shows that RISE makes meaningful improvements to responses to arrive at the correct solution for challenging prompts, without disrupting one-turn abilities as a result of expressing more complex distributions.
- [329] arXiv:2407.18220 [pdf, html, other]
-
Title: Detecting and explaining (in)equivalence of context-free grammarsSubjects: Formal Languages and Automata Theory (cs.FL); Computers and Society (cs.CY); Programming Languages (cs.PL)
We propose a scalable framework for deciding, proving, and explaining (in)equivalence of context-free grammars. We present an implementation of the framework and evaluate it on large data sets collected within educational support systems. Even though the equivalence problem for context-free languages is undecidable in general, the framework is able to handle a large portion of these datasets. It introduces and combines techniques from several areas, such as an abstract grammar transformation language to identify equivalent grammars as well as sufficiently similar inequivalent grammars, theory-based comparison algorithms for a large class of context-free languages, and a graph-theory-inspired grammar canonization that allows to efficiently identify isomorphic grammars.
- [330] arXiv:2407.18227 [pdf, html, other]
-
Title: Automated Ensemble Multimodal Machine Learning for HealthcareSubjects: Machine Learning (cs.LG)
The application of machine learning in medicine and healthcare has led to the creation of numerous diagnostic and prognostic models. However, despite their success, current approaches generally issue predictions using data from a single modality. This stands in stark contrast with clinician decision-making which employs diverse information from multiple sources. While several multimodal machine learning approaches exist, significant challenges in developing multimodal systems remain that are hindering clinical adoption. In this paper, we introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning. AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies. In an illustrative application using a multimodal skin lesion dataset, we highlight the importance of multimodal machine learning and the power of combining multiple fusion strategies using ensemble learning. We have open-sourced our framework as a tool for the community and hope it will accelerate the uptake of multimodal machine learning in healthcare and spur further innovation.
- [331] arXiv:2407.18228 [pdf, other]
-
Title: Parameterized Algorithms on Integer Sets with Small Doubling: Integer Programming, Subset Sum and k-SUMComments: 24 pages, 0 figuresSubjects: Data Structures and Algorithms (cs.DS)
We study the parameterized complexity of algorithmic problems whose input is an integer set $A$ in terms of the doubling constant $C := |A + A|/|A|$, a fundamental measure of additive structure. We present evidence that this new parameterization is algorithmically useful in the form of new results for two difficult, well-studied problems: Integer Programming and Subset Sum.
First, we show that determining the feasibility of bounded Integer Programs is a tractable problem when parameterized in the doubling constant. Specifically, we prove that the feasibility of an integer program $I$ with $n$ polynomially-bounded variables and $m$ constraints can be determined in time $n^{O_C(1)} poly(|I|)$ when the column set of the constraint matrix has doubling constant $C$.
Second, we show that the Subset Sum and Unbounded Subset Sum problems can be solved in time $n^{O_C(1)}$ and $n^{O_C(\log \log \log n)}$, respectively, where the $O_C$ notation hides functions that depend only on the doubling constant $C$. We also show the equivalence of achieving an FPT algorithm for Subset Sum with bounded doubling and achieving a milestone result for the parameterized complexity of Box ILP. Finally, we design near-linear time algorithms for $k$-SUM as well as tight lower bounds for 4-SUM and nearly tight lower bounds for $k$-SUM, under the $k$-SUM conjecture.
Several of our results rely on a new proof that Freiman's Theorem, a central result in additive combinatorics, can be made efficiently constructive. This result may be of independent interest. - [332] arXiv:2407.18232 [pdf, html, other]
-
Title: LION: Linear Group RNN for 3D Object Detection in Point CloudsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
The benefit of transformers in large-scale 3D point cloud perception tasks, such as 3D object detection, is limited by their quadratic computation cost when modeling long-range relationships. In contrast, linear RNNs have low computational complexity and are suitable for long-range modeling. Toward this goal, we propose a simple and effective window-based framework built on LInear grOup RNN (i.e., perform linear RNN for grouped features) for accurate 3D object detection, called LION. The key property is to allow sufficient feature interaction in a much larger group than transformer-based methods. However, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling. To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features. To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features thanks to linear group RNN as a natural property of auto-regressive models. Extensive experiments verify the effectiveness of the proposed components and the generalization of our LION on different linear group RNN operators including Mamba, RWKV, and RetNet. Furthermore, it is worth mentioning that our LION-Mamba achieves state-of-the-art on Waymo, nuScenes, Argoverse V2, and ONCE dataset. Last but not least, our method supports kinds of advanced linear RNN operators (e.g., RetNet, RWKV, Mamba, xLSTM and TTT) on small but popular KITTI dataset for a quick experience with our linear RNN-based framework.
- [333] arXiv:2407.18240 [pdf, html, other]
-
Title: CodedVO: Coded Visual OdometryComments: 7 pages, 4 figures, IEEE ROBOTICS AND AUTOMATION LETTERSJournal-ref: IEEE ROBOTICS AND AUTOMATION LETTERS, 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Autonomous robots often rely on monocular cameras for odometry estimation and navigation. However, the scale ambiguity problem presents a critical barrier to effective monocular visual odometry. In this paper, we present CodedVO, a novel monocular visual odometry method that overcomes the scale ambiguity problem by employing custom optics to physically encode metric depth information into imagery. By incorporating this information into our odometry pipeline, we achieve state-of-the-art performance in monocular visual odometry with a known scale. We evaluate our method in diverse indoor environments and demonstrate its robustness and adaptability. We achieve a 0.08m average trajectory error in odometry evaluation on the ICL-NUIM indoor odometry dataset.
- [334] arXiv:2407.18241 [pdf, html, other]
-
Title: Numerical Literals in Link Prediction: A Critical Examination of Models and DatasetsSubjects: Machine Learning (cs.LG); Databases (cs.DB)
Link Prediction(LP) is an essential task over Knowledge Graphs(KGs), traditionally focussed on using and predicting the relations between entities. Textual entity descriptions have already been shown to be valuable, but models that incorporate numerical literals have shown minor improvements on existing benchmark datasets. It is unclear whether a model is actually better in using numerical literals, or better capable of utilizing the graph structure. This raises doubts about the effectiveness of these methods and about the suitability of the existing benchmark datasets.
We propose a methodology to evaluate LP models that incorporate numerical literals. We propose i) a new synthetic dataset to better understand how well these models use numerical literals and ii) dataset ablations strategies to investigate potential difficulties with the existing datasets. We identify a prevalent trend: many models underutilize literal information and potentially rely on additional parameters for performance gains. Our investigation highlights the need for more extensive evaluations when releasing new models and datasets. - [335] arXiv:2407.18242 [pdf, html, other]
-
Title: LoRA-Pro: Are Low-Rank Adapters Properly Optimized?Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Low-Rank Adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning foundation models by re-parameterizing the original matrix into the product of two low-rank matrices. Despite its efficiency, LoRA often yields inferior performance compared to full fine-tuning. In this paper, we propose LoRA-Pro to bridge this performance gap. Firstly, we delve into the optimization processes in LoRA and full fine-tuning. We reveal that while LoRA employs low-rank approximation, it neglects to approximate the optimization process of full fine-tuning. To address this, we introduce a novel concept called the "equivalent gradient." This virtual gradient makes the optimization process on the re-parameterized matrix equivalent to LoRA, which can be used to quantify the differences between LoRA and full fine-tuning. The equivalent gradient is derived from the gradients of matrices $A$ and $B$. To narrow the performance gap, our approach minimizes the differences between the equivalent gradient and the gradient obtained from full fine-tuning during the optimization process. By solving this objective, we derive optimal closed-form solutions for updating matrices $A$ and $B$. Our method constrains the optimization process, shrinking the performance gap between LoRA and full fine-tuning. Extensive experiments on natural language processing tasks validate the effectiveness of our method.
- [336] arXiv:2407.18243 [pdf, html, other]
-
Title: BIV-Priv-Seg: Locating Private Content in Images Taken by People With Visual ImpairmentsYu-Yun Tseng, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Leah Findlater, Yang Wang, Danna Gurari Yu-Yun Tseng, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Leah Findlater, Yang Wang, Danna GurariSubjects: Computer Vision and Pattern Recognition (cs.CV)
Individuals who are blind or have low vision (BLV) are at a heightened risk of sharing private information if they share photographs they have taken. To facilitate developing technologies that can help preserve privacy, we introduce BIV-Priv-Seg, the first localization dataset originating from people with visual impairments that shows private content. It contains 1,028 images with segmentation annotations for 16 private object categories. We first characterize BIV-Priv-Seg and then evaluate modern models' performance for locating private content in the dataset. We find modern models struggle most with locating private objects that are not salient, small, and lack text as well as recognizing when private content is absent from an image. We facilitate future extensions by sharing our new dataset with the evaluation server at this https URL.
- [337] arXiv:2407.18244 [pdf, html, other]
-
Title: RefMask3D: Language-Guided Transformer for 3D Referring SegmentationComments: ACM MM 2024, Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module that analyzes the interrelationships among linguistic primitives to consolidate their insights and pinpoint common characteristics, helping to capture holistic information and enhance the precision of target identification. The proposed RefMask3D achieves new state-of-the-art performance on 3D referring segmentation, 3D visual grounding, and also 2D referring image segmentation. Especially, RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU} on the challenging ScanRefer dataset. Code is available at this https URL.
- [338] arXiv:2407.18245 [pdf, html, other]
-
Title: VGGHeads: A Large-Scale Synthetic Dataset for 3D Human HeadsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Human head detection, keypoint estimation, and 3D head model fitting are important tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce VGGHeads -- a large scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset we introduce a new model architecture capable of simultaneous heads detection and head meshes reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads. Additionally, we provide detailed information about the synthetic data generation pipeline, enabling it to be re-used for other tasks and domains.
- [339] arXiv:2407.18247 [pdf, other]
-
Title: RegionDrag: Fast Region-Based Image Editing with Diffusion ModelsComments: ECCV 2024, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Point-drag-based image editing methods, like DragDiffusion, have attracted significant attention. However, point-drag-based approaches suffer from computational overhead and misinterpretation of user intentions due to the sparsity of point-based editing instructions. In this paper, we propose a region-based copy-and-paste dragging method, RegionDrag, to overcome these limitations. RegionDrag allows users to express their editing instructions in the form of handle and target regions, enabling more precise control and alleviating ambiguity. In addition, region-based operations complete editing in one iteration and are much faster than point-drag-based methods. We also incorporate the attention-swapping technique for enhanced stability during editing. To validate our approach, we extend existing point-drag-based datasets with region-based dragging instructions. Experimental results demonstrate that RegionDrag outperforms existing point-drag-based approaches in terms of speed, accuracy, and alignment with user intentions. Remarkably, RegionDrag completes the edit on an image with a resolution of 512x512 in less than 2 seconds, which is more than 100x faster than DragDiffusion, while achieving better performance. Project page: this https URL.
- [340] arXiv:2407.18248 [pdf, html, other]
-
Title: Self-Training with Direct Preference Optimization Improves Chain-of-Thought ReasoningComments: ACL 2024. Code and data are available at this https URLSubjects: Computation and Language (cs.CL)
Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We evaluate our method across various mathematical reasoning tasks using different base models. Our experiments show that this approach not only improves LMs' reasoning performance but also offers a more cost-effective and scalable solution compared to relying on large proprietary LMs.
- [341] arXiv:2407.18249 [pdf, html, other]
-
Title: Trajectory-aligned Space-time Tokens for Few-shot Action RecognitionComments: ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations. By harnessing recent progress in tracking, specifically point trajectories and self-supervised representation learning, we build trajectory-aligned tokens (TATs) that capture motion and appearance information. This approach significantly reduces the data requirements while retaining essential information. To process these representations, we use a Masked Space-time Transformer that effectively learns to aggregate information to facilitate few-shot action recognition. We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets. Our project page is available at this https URL
- [342] arXiv:2407.18251 [pdf, html, other]
-
Title: Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical AnalysisSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Assessing the robustness of multimodal models against adversarial examples is an important aspect for the safety of its users. We craft L0-norm perturbation attacks on the preprocessed input images. We launch them in a black-box setup against four multimodal models and two unimodal DNNs, considering both targeted and untargeted misclassification. Our attacks target less than 0.04% of perturbed image area and integrate different spatial positioning of perturbed pixels: sparse positioning and pixels arranged in different contiguous shapes (row, column, diagonal, and patch). To the best of our knowledge, we are the first to assess the robustness of three state-of-the-art multimodal models (ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel distribution perturbations. The obtained results indicate that unimodal DNNs are more robust than multimodal models. Furthermore, models using CNN-based Image Encoder are more vulnerable than models with ViT - for untargeted attacks, we obtain a 99% success rate by perturbing less than 0.02% of the image area.
New submissions for Friday, 26 July 2024 (showing 342 of 342 entries )
- [343] arXiv:2407.14335 (cross-list from econ.GN) [pdf, html, other]
-
Title: Quantifying the Blockchain Trilemma: A Comparative Analysis of Algorand, Ethereum 2.0, and BeyondSubjects: General Economics (econ.GN); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Computational Finance (q-fin.CP); Computation (stat.CO)
Blockchain technology is essential for the digital economy and metaverse, supporting applications from decentralized finance to virtual assets. However, its potential is constrained by the "Blockchain Trilemma," which necessitates balancing decentralization, security, and scalability. This study evaluates and compares two leading proof-of-stake (PoS) systems, Algorand and Ethereum 2.0, against these critical metrics. Our research interprets existing indices to measure decentralization, evaluates scalability through transactional data, and assesses security by identifying potential vulnerabilities. Utilizing real-world data, we analyze each platform's strategies in a structured manner to understand their effectiveness in addressing trilemma challenges. The findings highlight each platform's strengths and propose general methodologies for evaluating key blockchain characteristics applicable to other systems. This research advances the understanding of blockchain technologies and their implications for the future digital economy. Data and code are available on GitHub as open source.
- [344] arXiv:2407.16020 (cross-list from quant-ph) [pdf, other]
-
Title: Sparks of Quantum Advantage and Rapid Retraining in Machine LearningComments: Fixed figure 2 in v2Subjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Machine Learning (stat.ML)
The advent of quantum computing holds the potential to revolutionize various fields by solving complex problems more efficiently than classical computers. Despite this promise, practical quantum advantage is hindered by current hardware limitations, notably the small number of qubits and high noise levels. In this study, we leverage adiabatic quantum computers to optimize Kolmogorov-Arnold Networks, a powerful neural network architecture for representing complex functions with minimal parameters. By modifying the network to use Bezier curves as the basis functions and formulating the optimization problem into a Quadratic Unconstrained Binary Optimization problem, we create a fixed-sized solution space, independent of the number of training samples. Our approach demonstrates sparks of quantum advantage through faster training times compared to classical optimizers such as the Adam, Stochastic Gradient Descent, Adaptive Gradient, and simulated annealing. Additionally, we introduce a novel rapid retraining capability, enabling the network to be retrained with new data without reprocessing old samples, thus enhancing learning efficiency in dynamic environments. Experimental results on initial training of classification and regression tasks validate the efficacy of our approach, showcasing significant speedups and comparable performance to classical methods. While experiments on retraining demonstrate a sixty times speed up using adiabatic quantum computing based optimization compared to that of the gradient descent based optimizers, with theoretical models allowing this speed up to be even larger! Our findings suggest that with further advancements in quantum hardware and algorithm optimization, quantum-optimized machine learning models could have broad applications across various domains, with initial focus on rapid retraining.
- [345] arXiv:2407.17485 (cross-list from physics.chem-ph) [pdf, html, other]
-
Title: Application of the Digital Annealer Unit in Optimizing Chemical Reaction Conditions for Enhanced Production YieldsShih-Cheng Li, Pei-Hwa Wang, Jheng-Wei Su, Wei-Yin Chiang, Shih-Hsien Huang, Yen-Chu Lin, Chia-Ho Ou, Chih-Yu ChenSubjects: Chemical Physics (physics.chem-ph); Emerging Technologies (cs.ET); Data Analysis, Statistics and Probability (physics.data-an)
Finding appropriate reaction conditions that yield high product rates in chemical synthesis is crucial for the chemical and pharmaceutical industries. However, due to the vast chemical space, conducting experiments for each possible reaction condition is impractical. Consequently, models such as QSAR (Quantitative Structure-Activity Relationship) or ML (Machine Learning) have been developed to predict the outcomes of reactions and illustrate how reaction conditions affect product yield. Despite these advancements, inferring all possible combinations remains computationally prohibitive when using a conventional CPU. In this work, we explore using a Digital Annealing Unit (DAU) to tackle these large-scale optimization problems more efficiently by solving Quadratic Unconstrained Binary Optimization (QUBO). Two types of QUBO models are constructed in this work: one using quantum annealing and the other using ML. Both models are built and tested on four high-throughput experimentation (HTE) datasets and selected Reaxys datasets. Our results suggest that the performance of models is comparable to classical ML methods (i.e., Random Forest and Multilayer Perceptron (MLP)), while the inference time of our models requires only seconds with a DAU. Additionally, in campaigns involving active learning and autonomous design of reaction conditions to achieve higher reaction yield, our model demonstrates significant improvements by adding new data, showing promise of adopting our method in the iterative nature of such problem settings. Our method can also accelerate the screening of billions of reaction conditions, achieving speeds millions of times faster than traditional computing units in identifying superior conditions. Therefore, leveraging the DAU with our developed QUBO models has the potential to be a valuable tool for innovative chemical synthesis.
- [346] arXiv:2407.17492 (cross-list from physics.chem-ph) [pdf, html, other]
-
Title: Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for ChemistryComments: 14 pages, submited to conference, code available at: this https URLSubjects: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Spectroscopic techniques are essential tools for determining the structure of molecules. Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared spectroscopy, and Mass Spectrometry, provide insight into the molecular structure, including the presence or absence of functional groups. Chemists leverage the complementary nature of the different methods to their advantage. However, the lack of a comprehensive multimodal dataset, containing spectra from a variety of spectroscopic techniques, has limited machine-learning approaches mostly to single-modality tasks for predicting molecular structures from spectra. Here we introduce a dataset comprising simulated $^1$H-NMR, $^{13}$C-NMR, HSQC-NMR, Infrared, and Mass spectra (positive and negative ion modes) for 790k molecules extracted from chemical reactions in patent data. This dataset enables the development of foundation models for integrating information from multiple spectroscopic modalities, emulating the approach employed by human experts. Additionally, we provide benchmarks for evaluating single-modality tasks such as structure elucidation, predicting the spectra for a target molecule, and functional group predictions. This dataset has the potential automate structure elucidation, streamlining the molecular discovery pipeline from synthesis to structure determination. The dataset and code for the benchmarks can be found at this https URL.
- [347] arXiv:2407.17505 (cross-list from q-bio.NC) [pdf, other]
-
Title: Survey on biomarkers in human vocalizationsAki Härmä, Bert den Brinker, Ulf Grossekathofer, Okke Ouweltjes, Srikanth Nallanthighal, Sidharth Abrol, Vibhu SharmaSubjects: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
Recent years has witnessed an increase in technologies that use speech for the sensing of the health of the talker. This survey paper proposes a general taxonomy of the technologies and a broad overview of current progress and challenges. Vocal biomarkers are often secondary measures that are approximating a signal of another sensor or identifying an underlying mental, cognitive, or physiological state. Their measurement involve disturbances and uncertainties that may be considered as noise sources and the biomarkers are coarsely qualified in terms of the various sources of noise involved in their determination. While in some proposed biomarkers the error levels seem high, there are vocal biomarkers where the errors are expected to be low and thus are more likely to qualify as candidates for adoption in healthcare applications.
- [348] arXiv:2407.17624 (cross-list from q-fin.RM) [pdf, html, other]
-
Title: Traditional Methods Outperform Generative LLMs at Forecasting Credit RatingsSubjects: Risk Management (q-fin.RM); Computation and Language (cs.CL); General Finance (q-fin.GN)
Large Language Models (LLMs) have been shown to perform well for many downstream tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre-training. In financial contexts, LLMs can sometimes beat well-established benchmarks. This paper investigates how well LLMs perform in the task of forecasting corporate credit ratings. We show that while LLMs are very good at encoding textual information, traditional methods are still very competitive when it comes to encoding numeric and multimodal data. For our task, current LLMs perform worse than a more traditional XGBoost architecture that combines fundamental and macroeconomic data with high-density text-based embedding features.
- [349] arXiv:2407.17625 (cross-list from physics.flu-dyn) [pdf, other]
-
Title: Mixed Convection and Entropy Generation Analysis of CNT-Water Nanofluid in a Square Cavity with Cylinders and Flow DeflectorsSubjects: Fluid Dynamics (physics.flu-dyn); Numerical Analysis (math.NA)
This study explores the mixed convection of CNT-water nanofluid within a square cavity containing heated cylinders under the influence of a magnetic field, focusing on three geometric configurations: a single heated cylinder, two heated cylinders, and two heated cylinders with a flow deflector. The impact of various parameters, including Reynolds number (Re), Richardson number (Ri), Hartmann number (Ha), wavy wall peaks (n), nanoparticle volume fraction ({\phi}), Hartmann angle ({\gamma}), rotational speed ({\omega}), and inclination angle ({\alpha}), on thermal and fluid dynamic behaviors is analyzed. Results reveal that MWCNT nanofluids consistently achieve higher Nusselt numbers than SWCNT nanofluids, indicating superior heat transfer capabilities. Introducing a second cylinder and a flow deflector enhances thermal interactions, while increasing Ha stabilizes the flow, improving thermal performance. Wavy wall peaks further enhance fluid mixing and heat transfer efficiency. Additionally, SWCNT nanofluids exhibit higher Bejan numbers, indicating a greater dominance of thermal entropy generation over fluid friction. These findings provide valuable insights for optimizing thermal management systems in engineering applications, highlighting the importance of selecting appropriate nanofluids, geometric configurations, and magnetic field parameters to achieve optimal thermal performance and fluid stability.
- [350] arXiv:2407.17641 (cross-list from quant-ph) [pdf, html, other]
-
Title: Regular language quantum statesComments: 12 pages, 1 figureSubjects: Quantum Physics (quant-ph); Formal Languages and Automata Theory (cs.FL)
We introduce regular language states, a family of quantum many-body states. They are built from a special class of formal languages, called regular, which has been thoroughly studied in the field of computer science. They can be understood as the superposition of all the words in a regular language and encompass physically relevant states such as the GHZ-, W- or Dicke-states. By leveraging the theory of regular languages, we develop a theoretical framework to describe them. First, we express them in terms of matrix product states, providing efficient criteria to recognize them. We then develop a canonical form which allows us to formulate a fundamental theorem for the equivalence of regular language states, including under local unitary operations. We also exploit the theory of tensor networks to find an efficient criterion to determine when regular languages are shift-invariant.
- [351] arXiv:2407.17667 (cross-list from astro-ph.IM) [pdf, html, other]
-
Title: Tackling the Problem of Distributional Shifts: Correcting Misspecified, High-Dimensional Data-Driven Priors for Inverse ProblemsComments: 17 pages, 15 figures, Submitted to The Astrophysical JournalSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
Bayesian inference for inverse problems hinges critically on the choice of priors. In the absence of specific prior information, population-level distributions can serve as effective priors for parameters of interest. With the advent of machine learning, the use of data-driven population-level distributions (encoded, e.g., in a trained deep neural network) as priors is emerging as an appealing alternative to simple parametric priors in a variety of inverse problems. However, in many astrophysical applications, it is often difficult or even impossible to acquire independent and identically distributed samples from the underlying data-generating process of interest to train these models. In these cases, corrupted data or a surrogate, e.g. a simulator, is often used to produce training samples, meaning that there is a risk of obtaining misspecified priors. This, in turn, can bias the inferred posteriors in ways that are difficult to quantify, which limits the potential applicability of these models in real-world scenarios. In this work, we propose addressing this issue by iteratively updating the population-level distributions by retraining the model with posterior samples from different sets of observations and showcase the potential of this method on the problem of background image reconstruction in strong gravitational lensing when score-based models are used as data-driven priors. We show that starting from a misspecified prior distribution, the updated distribution becomes progressively closer to the underlying population-level distribution, and the resulting posterior samples exhibit reduced bias after several updates.
- [352] arXiv:2407.17706 (cross-list from quant-ph) [pdf, html, other]
-
Title: Investigating and Mitigating Barren Plateaus in Variational Quantum Circuits: A SurveyComments: preprint, under review. Please feel free to reach out if your work fits within our scopeSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
In recent years, variational quantum circuits (VQCs) have been widely explored to advance quantum circuits against classic models on various domains, such as quantum chemistry and quantum machine learning. Similar to classic machine-learning models, VQCs can be optimized through gradient-based approaches. However, the gradient variance of VQCs may dramatically vanish as the number of qubits or layers increases. This issue, a.k.a. Barren Plateaus (BPs), seriously hinders the scaling of VQCs on large datasets. To mitigate the exponential gradient vanishing, extensive efforts have been devoted to tackling this issue through diverse strategies. In this survey, we conduct a systematic literature review of recent works from both investigation and mitigation perspectives. Besides, we propose a new taxonomy to categorize most existing mitigation strategies. At last, we provide insightful discussion for future directions of BPs.
- [353] arXiv:2407.17731 (cross-list from econ.GN) [pdf, html, other]
-
Title: Optimal Trade and Industrial Policies in the Global Economy: A Deep Learning FrameworkSubjects: General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
We propose a deep learning framework, DL-opt, designed to efficiently solve for optimal policies in quantifiable general equilibrium trade models. DL-opt integrates (i) a nested fixed point (NFXP) formulation of the optimization problem, (ii) automatic implicit differentiation to enhance gradient descent for solving unilateral optimal policies, and (iii) a best-response dynamics approach for finding Nash equilibria. Utilizing DL-opt, we solve for non-cooperative tariffs and industrial subsidies across 7 economies and 44 sectors, incorporating sectoral external economies of scale. Our quantitative analysis reveals significant sectoral heterogeneity in Nash policies: Nash industrial subsidies increase with scale elasticities, whereas Nash tariffs decrease with trade elasticities. Moreover, we show that global dual competition, involving both tariffs and industrial subsidies, results in lower tariffs and higher welfare outcomes compared to a global tariff war. These findings highlight the importance of considering sectoral heterogeneity and policy combinations in understanding global economic competition.
- [354] arXiv:2407.17777 (cross-list from eess.SP) [pdf, html, other]
-
Title: Advancing Multi-Modal Sensing Through Expandable Modality AlignmentSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they all require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem. To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of Babel involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment.
Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to various baselines e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension. - [355] arXiv:2407.17780 (cross-list from eess.IV) [pdf, html, other]
-
Title: HF-Fed: Hierarchical based customized Federated Learning Framework for X-Ray ImagingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In clinical applications, X-ray technology is vital for noninvasive examinations like mammography, providing essential anatomical information. However, the radiation risk associated with X-ray procedures raises concerns. X-ray reconstruction is crucial in medical imaging for detailed visual representations of internal structures, aiding diagnosis and treatment without invasive procedures. Recent advancements in deep learning (DL) have shown promise in X-ray reconstruction, but conventional DL methods often require centralized aggregation of large datasets, leading to domain shifts and privacy issues. To address these challenges, we introduce the Hierarchical Framework-based Federated Learning method (HF-Fed) for customized X-ray imaging. HF-Fed tackles X-ray imaging optimization by decomposing the problem into local data adaptation and holistic X-ray imaging. It employs a hospital-specific hierarchical framework and a shared common imaging network called Network of Networks (NoN) to acquire stable features from diverse data distributions. The hierarchical hypernetwork extracts domain-specific hyperparameters, conditioning the NoN for customized X-ray reconstruction. Experimental results demonstrate HF-Fed's competitive performance, offering a promising solution for enhancing X-ray imaging without data sharing. This study significantly contributes to the literature on federated learning in healthcare, providing valuable insights for policymakers and healthcare providers. The source code and pre-trained HF-Fed model are available at \url{this https URL}.
- [356] arXiv:2407.17793 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Use-dependent Biases as Optimal Action under Information BottleneckSubjects: Neurons and Cognition (q-bio.NC); Information Theory (cs.IT)
Use-dependent bias is a phenomenon in human sensorimotor behavior whereby movements become biased towards previously repeated actions. Despite being well-documented, the reason why this phenomenon occurs is not year clearly understood. Here, we propose that use-dependent biases can be understood as a rational strategy for movement under limitations on the capacity to process sensory information to guide motor output. We adopt an information-theoretic approach to characterize sensorimotor information processing and determine how behavior should be optimized given limitations to this capacity. We show that this theory naturally predicts the existence of use-dependent biases. Our framework also generates two further predictions. The first prediction relates to handedness. The dominant hand is associated with enhanced dexterity and reduced movement variability compared to the non-dominant hand, which we propose relates to a greater capacity for information processing in regions that control movement of the dominant hand. Consequently, the dominant hand should exhibit smaller use-dependent biases compared to the non-dominant hand. The second prediction relates to how use-dependent biases are affected by movement speed. When moving faster, it is more challenging to correct for initial movement errors online during the movement. This should exacerbate costs associated with initial directional error and, according to our theory, reduce the extent of use-dependent biases compared to slower movements, and vice versa. We show that these two empirical predictions, the handedness effect and the speed-dependent effect, are confirmed by experimental data.
- [357] arXiv:2407.17823 (cross-list from math.OC) [pdf, html, other]
-
Title: Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel OptimizationComments: ICML 2024 (Oral). arXiv admin note: text overlap with arXiv:2311.04520Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
Bilevel optimization is widely applied in many machine learning tasks such as hyper-parameter learning, meta learning and reinforcement learning. Although many algorithms recently have been developed to solve the bilevel optimization problems, they generally rely on the (strongly) convex lower-level problems. More recently, some methods have been proposed to solve the nonconvex-PL bilevel optimization problems, where their upper-level problems are possibly nonconvex, and their lower-level problems are also possibly nonconvex while satisfying Polyak-Łojasiewicz (PL) condition. However, these methods still have a high convergence complexity or a high computation complexity such as requiring compute expensive Hessian/Jacobian matrices and its inverses. In the paper, thus, we propose an efficient Hessian/Jacobian-free method (i.e., HJFBiO) with the optimal convergence complexity to solve the nonconvex-PL bilevel problems. Theoretically, under some mild conditions, we prove that our HJFBiO method obtains an optimal convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations, and has an optimal gradient complexity of $O(\epsilon^{-1})$ in finding an $\epsilon$-stationary solution. We conduct some numerical experiments on the bilevel PL game and hyper-representation learning task to demonstrate efficiency of our proposed method.
- [358] arXiv:2407.17851 (cross-list from math.ST) [pdf, html, other]
-
Title: Bad local minima exist in the stochastic block modelSubjects: Statistics Theory (math.ST); Discrete Mathematics (cs.DM); Information Theory (cs.IT); Mathematical Physics (math-ph); Probability (math.PR)
We study the disassortative stochastic block model with three communities, a well-studied model of graph partitioning and Bayesian inference for which detailed predictions based on the cavity method exist [Decelle et al. (2011)]. We provide strong evidence that for a part of the phase where efficient algorithms exist that approximately reconstruct the communities, inference based on maximum a posteriori (MAP) fails. In other words, we show that there exist modes of the posterior distribution that have a vanishing agreement with the ground truth. The proof is based on the analysis of a graph colouring algorithm from [Achlioptas and Moore (2003)].
- [359] arXiv:2407.17866 (cross-list from q-fin.ST) [pdf, html, other]
-
Title: Financial Statement Analysis with Large Language ModelsComments: Previously posted on SSRN (May 21, 2024). See this http URLSubjects: Statistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Finance (q-fin.GN); Portfolio Management (q-fin.PM)
We investigate whether an LLM can successfully perform financial statement analysis in a way similar to a professional human analyst. We provide standardized and anonymous financial statements to GPT4 and instruct the model to analyze them to determine the direction of future earnings. Even without any narrative or industry-specific information, the LLM outperforms financial analysts in its ability to predict earnings changes. The LLM exhibits a relative advantage over human analysts in situations when the analysts tend to struggle. Furthermore, we find that the prediction accuracy of the LLM is on par with the performance of a narrowly trained state-of-the-art ML model. LLM prediction does not stem from its training memory. Instead, we find that the LLM generates useful narrative insights about a company's future performance. Lastly, our trading strategies based on GPT's predictions yield a higher Sharpe ratio and alphas than strategies based on other models. Taken together, our results suggest that LLMs may take a central role in decision-making.
- [360] arXiv:2407.17910 (cross-list from stat.ML) [pdf, html, other]
-
Title: Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal InterferencesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Off-policy evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-commerce to evaluate the efficacy of novel products or policies from offline datasets. This paper introduces a causal deepset framework that relaxes several key structural assumptions, primarily the mean-field assumption, prevalent in existing OPE methodologies that handle spatio-temporal interference. These traditional assumptions frequently prove inadequate in real-world settings, thereby restricting the capability of current OPE methods to effectively address complex interference effects. In response, we advocate for the implementation of the permutation invariance (PI) assumption. This innovative approach enables the data-driven, adaptive learning of the mean-field function, offering a more flexible estimation method beyond conventional averaging. Furthermore, we present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations. Our numerical analyses demonstrate that this novel approach yields significantly more precise estimations than existing baseline algorithms, thereby substantially improving the practical applicability and effectiveness of OPE methodologies. A Python implementation of our proposed method is available at this https URL.
- [361] arXiv:2407.17938 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Analyzing Brain Tumor Connectomics using Graphs and Persistent HomologyComments: 15 Pages, 7 Figures, 2 Tables, TGI3-MICCAI WorkshopSubjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
Recent advances in molecular and genetic research have identified a diverse range of brain tumor sub-types, shedding light on differences in their molecular mechanisms, heterogeneity, and origins. The present study performs whole-brain connectome analysis using diffusionweighted images. To achieve this, both graph theory and persistent homology - a prominent approach in topological data analysis are employed in order to quantify changes in the structural connectivity of the wholebrain connectome in subjects with brain tumors. Probabilistic tractography is used to map the number of streamlines connecting 84 distinct brain regions, as delineated by the Desikan-Killiany atlas from FreeSurfer. These streamline mappings form the connectome matrix, on which persistent homology based analysis and graph theoretical analysis are executed to evaluate the discriminatory power between tumor sub-types that include meningioma and glioma. A detailed statistical analysis is conducted on persistent homology-derived topological features and graphical features to identify the brain regions where differences between study groups are statistically significant (p < 0.05). For classification purpose, graph-based local features are utilized, achieving a highest accuracy of 88%. In classifying tumor sub-types, an accuracy of 80% is attained. The findings obtained from this study underscore the potential of persistent homology and graph theoretical analysis of the whole-brain connectome in detecting alterations in structural connectivity patterns specific to different types of brain tumors.
- [362] arXiv:2407.17949 (cross-list from stat.ML) [pdf, html, other]
-
Title: Fast convergence of the Expectation Maximization algorithm under a logarithmic Sobolev inequalitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Computation (stat.CO)
By utilizing recently developed tools for constructing gradient flows on Wasserstein spaces, we extend an analysis technique commonly employed to understand alternating minimization algorithms on Euclidean space to the Expectation Maximization (EM) algorithm via its representation as coordinate-wise minimization on the product of a Euclidean space and a space of probability distributions due to Neal and Hinton (1998). In so doing we obtain finite sample error bounds and exponential convergence of the EM algorithm under a natural generalisation of a log-Sobolev inequality. We further demonstrate that the analysis technique is sufficiently flexible to allow also the analysis of several variants of the EM algorithm.
- [363] arXiv:2407.18017 (cross-list from nlin.CG) [pdf, html, other]
-
Title: A Sensitivity Analysis of Cellular Automata and Heterogeneous Topology Networks: Partially-Local Cellular Automata and Homogeneous Homogeneous Random Boolean NetworksSubjects: Cellular Automata and Lattice Gases (nlin.CG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Elementary Cellular Automata (ECA) are a well-studied computational universe that is, despite its simple configurations, capable of impressive computational variety. Harvesting this computation in a useful way has historically shown itself to be difficult, but if combined with reservoir computing (RC), this becomes much more feasible. Furthermore, RC and ECA enable energy-efficient AI, making the combination a promising concept for Edge AI. In this work, we contrast ECA to substrates of Partially-Local CA (PLCA) and Homogeneous Homogeneous Random Boolean Networks (HHRBN). They are, in comparison, the topological heterogeneous counterparts of ECA. This represents a step from ECA towards more biological-plausible substrates. We analyse these substrates by testing on an RC benchmark (5-bit memory), using Temporal Derrida plots to estimate the sensitivity and assess the defect collapse rate. We find that, counterintuitively, disordered topology does not necessarily mean disordered computation. There are countering computational "forces" of topology imperfections leading to a higher collapse rate (order) and yet, if accounted for, an increased sensitivity to the initial condition. These observations together suggest a shrinking critical range.
- [364] arXiv:2407.18021 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quadratic Advantage with Quantum Randomized Smoothing Applied to Time-Series AnalysisComments: Accepted at the IEEE International Conference on Quantum Computing and Engineering (QCE)Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
As quantum machine learning continues to develop at a rapid pace, the importance of ensuring the robustness and efficiency of quantum algorithms cannot be overstated. Our research presents an analysis of quantum randomized smoothing, how data encoding and perturbation modeling approaches can be matched to achieve meaningful robustness certificates. By utilizing an innovative approach integrating Grover's algorithm, a quadratic sampling advantage over classical randomized smoothing is achieved. This strategy necessitates a basis state encoding, thus restricting the space of meaningful perturbations. We show how constrained $k$-distant Hamming weight perturbations are a suitable noise distribution here, and elucidate how they can be constructed on a quantum computer. The efficacy of the proposed framework is demonstrated on a time series classification task employing a Bag-of-Words pre-processing solution. The advantage of quadratic sample reduction is recovered especially in the regime with large number of samples. This may allow quantum computers to efficiently scale randomized smoothing to more complex tasks beyond the reach of classical methods.
- [365] arXiv:2407.18026 (cross-list from eess.IV) [pdf, html, other]
-
Title: Segmentation-guided MRI reconstruction for meaningfully diverse reconstructionsComments: Accepted at DGM4MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Inverse problems, such as accelerated MRI reconstruction, are ill-posed and an infinite amount of possible and plausible solutions exist. This may not only lead to uncertainty in the reconstructed image but also in downstream tasks such as semantic segmentation. This uncertainty, however, is mostly not analyzed in the literature, even though probabilistic reconstruction models are commonly used. These models can be prone to ignore plausible but unlikely solutions like rare pathologies. Building on MRI reconstruction approaches based on diffusion models, we add guidance to the diffusion process during inference, generating two meaningfully diverse reconstructions corresponding to an upper and lower bound segmentation. The reconstruction uncertainty can then be quantified by the difference between these bounds, which we coin the 'uncertainty boundary'. We analyzed the behavior of the upper and lower bound segmentations for a wide range of acceleration factors and found the uncertainty boundary to be both more reliable and more accurate compared to repeated sampling. Code is available at this https URL
- [366] arXiv:2407.18054 (cross-list from eess.IV) [pdf, html, other]
-
Title: LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution KernelsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The segmentation of cell nuclei in tissue images stained with the blood dye hematoxylin and eosin (H$\&$E) is essential for various clinical applications and analyses. Due to the complex characteristics of cellular morphology, a large receptive field is considered crucial for generating high-quality segmentation. However, previous methods face challenges in achieving a balance between the receptive field and computational burden. To address this issue, we propose LKCell, a high-accuracy and efficient cell segmentation method. Its core insight lies in unleashing the potential of large convolution kernels to achieve computationally efficient large receptive fields. Specifically, (1) We transfer pre-trained large convolution kernel models to the medical domain for the first time, demonstrating their effectiveness in cell segmentation. (2) We analyze the redundancy of previous methods and design a new segmentation decoder based on large convolution kernels. It achieves higher performance while significantly reducing the number of parameters. We evaluate our method on the most challenging benchmark and achieve state-of-the-art results (0.5080 mPQ) in cell nuclei instance segmentation with only 21.6% FLOPs compared with the previous leading method. Our source code and models are available at this https URL.
- [367] arXiv:2407.18056 (cross-list from math.OC) [pdf, html, other]
-
Title: Computing an Aircraft's Gliding Range and Minimal Return Altitude in Presence of Obstacles and WindComments: 27 pages, 18 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Numerical Analysis (math.NA)
In the event of a total loss of thrust a pilot must identify a reachable landing site and subsequently execute a forced landing. To do so this, they must estimate which region on the ground can be reached safely in gliding flight. We call this the gliding reachable region (GRR). To compute the GRR, we employ an optimal control formulation aiming to reach a point in space while minimizing altitude loss. A simplified model of the aircraft's dynamics is used, where the effect of turns is neglected. The resulting equations are discretized on a grid and solved numerically. Our algorithm for computing the GRR is fast enough to run in real time during flight, it accounts for ground obstacles and wind, and for each point in the GRR it outputs the path to reach it with minimal loss of altitude. A related problem is estimating the minimal altitude an aircraft needs to glide to a given airfield in the presence of obstacles. This information enables pilots to plan routes that always have an airport within gliding distance. We formalize this problem using an optimal control formulation based on the same aircraft dynamics model. The resulting equations are solved with a second algorithm that outputs the minimal re-entry altitude and the paths to reach the airfield from any position while avoiding obstacles. The algorithms we develop are based on the Ordered Upwind Method and the Fast Marching Method.
- [368] arXiv:2407.18057 (cross-list from math.DS) [pdf, html, other]
-
Title: Physics-informed nonlinear vector autoregressive models for the prediction of dynamical systemsSubjects: Dynamical Systems (math.DS); Machine Learning (cs.LG)
Machine learning techniques have recently been of great interest for solving differential equations. Training these models is classically a data-fitting task, but knowledge of the expression of the differential equation can be used to supplement the training objective, leading to the development of physics-informed scientific machine learning. In this article, we focus on one class of models called nonlinear vector autoregression (NVAR) to solve ordinary differential equations (ODEs). Motivated by connections to numerical integration and physics-informed neural networks, we explicitly derive the physics-informed NVAR (piNVAR) which enforces the right-hand side of the underlying differential equation regardless of NVAR construction. Because NVAR and piNVAR completely share their learned parameters, we propose an augmented procedure to jointly train the two models. Then, using both data-driven and ODE-driven metrics, we evaluate the ability of the piNVAR model to predict solutions to various ODE systems, such as the undamped spring, a Lotka-Volterra predator-prey nonlinear model, and the chaotic Lorenz system.
- [369] arXiv:2407.18070 (cross-list from eess.IV) [pdf, html, other]
-
Title: CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-range semantic details, they suffer from high computational demands. In this study, we propose CSWin-UNet, a novel U-shaped segmentation method that incorporates the CSWin self-attention mechanism into the UNet to facilitate horizontal and vertical stripes self-attention. This method significantly enhances both computational efficiency and receptive field interactions. Additionally, our innovative decoder utilizes a content-aware reassembly operator that strategically reassembles features, guided by predicted kernels, for precise image resolution restoration. Our extensive empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet maintains low model complexity while delivering high segmentation accuracy.
- [370] arXiv:2407.18081 (cross-list from math.OC) [pdf, html, other]
-
Title: Optimal Control using Composite Bernstein ApproximantsComments: This paper was accepted for publication at the 2024 63rd IEEE Conference on Decision and Control (CDC)Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Numerical Analysis (math.NA)
In this work, we present composite Bernstein polynomials as a direct collocation method for approximating optimal control problems. An analysis of the convergence properties of composite Bernstein polynomials is provided, and beneficial properties of composite Bernstein polynomials for the solution of optimal control problems are discussed. The efficacy of the proposed approximation method is demonstrated through a bang-bang example. Lastly, we apply this method to a motion planning problem, offering a practical solution that emphasizes the ability of this method to solve complex optimal control problems.
- [371] arXiv:2407.18103 (cross-list from q-fin.CP) [pdf, html, other]
-
Title: Fine-Tuning Large Language Models for Stock Return Prediction Using NewsflowSubjects: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
Large language models (LLMs) and their fine-tuning techniques have demonstrated superior performance in various language understanding and generation tasks. This paper explores fine-tuning LLMs for stock return forecasting with financial newsflow. In quantitative investing, return forecasting is fundamental for subsequent tasks like stock picking, portfolio optimization, etc. We formulate the model to include text representation and forecasting modules. We propose to compare the encoder-only and decoder-only LLMs, considering they generate text representations in distinct ways. The impact of these different representations on forecasting performance remains an open question. Meanwhile, we compare two simple methods of integrating LLMs' token-level representations into the forecasting module. The experiments on real news and investment universes reveal that: (1) aggregated representations from LLMs' token-level embeddings generally produce return predictions that enhance the performance of long-only and long-short portfolios; (2) in the relatively large investment universe, the decoder LLMs-based prediction model leads to stronger portfolios, whereas in the small universes, there are no consistent winners. Among the three LLMs studied (DeBERTa, Mistral, Llama), Mistral performs more robustly across different universes; (3) return predictions derived from LLMs' text representations are a strong signal for portfolio construction, outperforming conventional sentiment scores.
- [372] arXiv:2407.18105 (cross-list from eess.IV) [pdf, html, other]
-
Title: Multi-Resolution Histopathology Patch Graphs for Ovarian Cancer SubtypingComments: Initially submitted version of a paper which has been accepted in the GRAIL workshop at MICCAI 2024Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Computer vision models are increasingly capable of classifying ovarian epithelial cancer subtypes, but they differ from pathologists by processing small tissue patches at a single resolution. Multi-resolution graph models leverage the spatial relationships of patches at multiple magnifications, learning the context for each patch. In this study, we conduct the most thorough validation of a graph model for ovarian cancer subtyping to date. Seven models were tuned and trained using five-fold cross-validation on a set of 1864 whole slide images (WSIs) from 434 patients treated at Leeds Teaching Hospitals NHS Trust. The cross-validation models were ensembled and evaluated using a balanced hold-out test set of 100 WSIs from 30 patients, and an external validation set of 80 WSIs from 80 patients in the Transcanadian Study. The best-performing model, a graph model using 10x+20x magnification data, gave balanced accuracies of 73%, 88%, and 99% in cross-validation, hold-out testing, and external validation, respectively. However, this only exceeded the performance of attention-based multiple instance learning in external validation, with a 93% balanced accuracy. Graph models benefitted greatly from using the UNI foundation model rather than an ImageNet-pretrained ResNet50 for feature extraction, with this having a much greater effect on performance than changing the subsequent classification approach. The accuracy of the combined foundation model and multi-resolution graph network offers a step towards the clinical applicability of these models, with a new highest-reported performance for this task, though further validations are still required to ensure the robustness and usability of the models.
- [373] arXiv:2407.18113 (cross-list from math.CO) [pdf, html, other]
-
Title: Upper bounds on the average edit distance between two random stringsSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
We study the average edit distance between two random strings. More precisely, we adapt a technique introduced by Lueker in the context of the average longest common subsequence of two random strings to improve the known upper bound on the average edit distance. We improve all the known upper bounds for small alphabets. We also provide a new implementation of Lueker technique to improve the lower bound on the average length of the longest common subsequence of two random strings for all small alphabets of size other than $2$ and $4$.
- [374] arXiv:2407.18126 (cross-list from math.CO) [pdf, html, other]
-
Title: Proof of a conjecture on isolation of graphs dominated by a vertexSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
A copy of a graph $F$ is called an $F$-copy. For any graph $G$, the $F$-isolation number of $G$, denoted by $\iota(G,F)$, is the size of a smallest subset $D$ of the vertex set of $G$ such that the closed neighbourhood $N[D]$ of $D$ in $G$ intersects the vertex sets of the $F$-copies contained by $G$ (equivalently, $G-N[D]$ contains no $F$-copy). Thus, $\iota(G,K_1)$ is the domination number $\gamma(G)$ of $G$, and $\iota(G,K_2)$ is the vertex-edge domination number of $G$. We prove that if $F$ is a $k$-edge graph, $\gamma(F) = 1$ (that is, a vertex of $F$ is adjacent to all the other vertices of $F$), and $G$ is a connected $m$-edge graph, then $\iota(G,F) \leq \big\lfloor \frac{m+1}{k+2} \big\rfloor$ unless $G$ is an $F$-copy or $F$ is a $3$-path and $G$ is a $6$-cycle. This was recently posed as a conjecture by Zhang and Wu, who settled the case where $F$ is a star. The result for the case where $F$ is a clique had been obtained by Fenech, Kaemawichanurat and the present author. The bound is attainable for any $m \geq 0$ unless $m = k \leq 2$. New ideas, such as the consideration of divisibility, are introduced in the proof of the conjecture.
- [375] arXiv:2407.18158 (cross-list from stat.ML) [pdf, html, other]
-
Title: Unlocking Tokens as Data Points for Generalization Bounds on Larger Language ModelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually benefit from far less restrictive compression schemes. With Monarch matrices, Kronecker factorizations, and post-training quantization, we achieve non-vacuous generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous approaches, our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
- [376] arXiv:2407.18180 (cross-list from physics.bio-ph) [pdf, other]
-
Title: Passive wing deployment and retraction in beetles and flapping microrobotsComments: 20 pages, 10 figuresSubjects: Biological Physics (physics.bio-ph); Robotics (cs.RO)
Birds, bats and many insects can tuck their wings against their bodies at rest and deploy them to power flight. Whereas birds and bats use well-developed pectoral and wing muscles and tendons, how insects control these movements remains unclear, as mechanisms of wing deployment and retraction vary among insect species. Beetles (Coleoptera) display one of the most complex wing mechanisms. For example, in rhinoceros beetles, the wing deployment initiates by fully opening the elytra and partially releasing the hindwings from the abdomen. Subsequently, the beetle starts flapping, elevates the hindwings at the bases, and unfolds the wingtips in an origami-like fashion. Whilst the origami-like fold have been extensively explored, limited attention has been given to the hindwing base deployment and retraction, which are believed to be driven by thoracic muscles. Using high-speed cameras and robotic flapping-wing models, here we demonstrate that rhinoceros beetles can effortlessly elevate the hindwings to flight position without the need for muscular activity. We show that opening the elytra triggers a spring-like partial release of the hindwings from the body, allowing the clearance needed for subsequent flapping motion that brings the hindwings into flight position. The results also show that after flight, beetles can leverage the elytra to push the hindwings back into the resting position, further strengthening the hypothesis of a passive deployment mechanism. Finally, we validate the hypothesis with a flapping microrobot that passively deploys its wings for stable controlled flight and retracts them neatly upon landing, which offers a simple yet effective approach to the design of insect-like flying micromachines.
- [377] arXiv:2407.18202 (cross-list from quant-ph) [pdf, html, other]
-
Title: Differentiable Quantum Architecture Search in Asynchronous Quantum Reinforcement LearningComments: Accepted by IEEE International Conference on Quantum Computing and Engineering - QCE 2024Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
The emergence of quantum reinforcement learning (QRL) is propelled by advancements in quantum computing (QC) and machine learning (ML), particularly through quantum neural networks (QNN) built on variational quantum circuits (VQC). These advancements have proven successful in addressing sequential decision-making tasks. However, constructing effective QRL models demands significant expertise due to challenges in designing quantum circuit architectures, including data encoding and parameterized circuits, which profoundly influence model performance. In this paper, we propose addressing this challenge with differentiable quantum architecture search (DiffQAS), enabling trainable circuit parameters and structure weights using gradient-based optimization. Furthermore, we enhance training efficiency through asynchronous reinforcement learning (RL) methods facilitating parallel training. Through numerical simulations, we demonstrate that our proposed DiffQAS-QRL approach achieves performance comparable to manually-crafted circuit architectures across considered environments, showcasing stability across diverse scenarios. This methodology offers a pathway for designing QRL models without extensive quantum knowledge, ensuring robust performance and fostering broader application of QRL.
Cross submissions for Friday, 26 July 2024 (showing 35 of 35 entries )
- [378] arXiv:1903.07328 (replaced) [pdf, other]
-
Title: Parametric Timed Pattern MatchingComments: This is the author version of the manuscript of the same name published in ACM Transactions on Software Engineering and Methodology (Volume 32, Issue 1, 2023). This manuscript is an extension of [ICECCS 2018, NFM 2019], with [ICECCS 2018] describing the first method of this manuscript (based on parametric timed model checking) while [NFM 2019] describes the second dedicated method. arXiv admin note: substantial text overlap with arXiv:1812.08940Journal-ref: ACM Transactions on Software Engineering and Methodology (Volume 32, Issue 1, 2023)Subjects: Formal Languages and Automata Theory (cs.FL)
Given a log and a specification, timed pattern matching aims at exhibiting for which start and end dates a specification holds on that log. For example, "a given action is always followed by another action before a given deadline". This problem has strong connections with monitoring real-time systems. We address here timed pattern matching in the presence of an uncertain specification, i.e., that may contain timing parameters (e.g., the deadline can be uncertain or unknown). We want to know for which start and end dates, and for what values of the timing parameters, a property holds. For instance, we look for the minimum or maximum deadline (together with the corresponding start and end dates) for which the property holds. We propose two frameworks for parametric timed pattern matching. The first one is based on parametric timed model checking. In contrast to most parametric timed problems, the solution is effectively computable. The second one is a dedicated method; not only we largely improve the efficiency compared to the first method, but we further propose optimizations with skipping. Our experiment results suggest that our algorithms, especially the second one, are efficient and practically relevant.
- [379] arXiv:1906.08619 (replaced) [pdf, html, other]
-
Title: Bayesian Modelling in Practice: Using Uncertainty to Improve Trustworthiness in Medical ApplicationsComments: Presented at AISG @ ICML2019: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The Intensive Care Unit (ICU) is a hospital department where machine learning has the potential to provide valuable assistance in clinical decision making. Classical machine learning models usually only provide point-estimates and no uncertainty of predictions. In practice, uncertain predictions should be presented to doctors with extra care in order to prevent potentially catastrophic treatment decisions. In this work we show how Bayesian modelling and the predictive uncertainty that it provides can be used to mitigate risk of misguided prediction and to detect out-of-domain examples in a medical setting. We derive analytically a bound on the prediction loss with respect to predictive uncertainty. The bound shows that uncertainty can mitigate loss. Furthermore, we apply a Bayesian Neural Network to the MIMIC-III dataset, predicting risk of mortality of ICU patients. Our empirical results show that uncertainty can indeed prevent potential errors and reliably identifies out-of-domain patients. These results suggest that Bayesian predictive uncertainty can greatly improve trustworthiness of machine learning models in high-risk settings such as the ICU.
- [380] arXiv:2007.02776 (replaced) [pdf, html, other]
-
Title: Reduction of a nonlinear system and its numerical solution using a fractional iterative methodComments: arXiv admin note: substantial text overlap with arXiv:2006.14963, arXiv:1804.08445; text overlap with arXiv:2004.10860Journal-ref: Journal of Mathematics and Statistical Science, 6:285-299, 2020Subjects: Numerical Analysis (math.NA); Functional Analysis (math.FA)
A nonlinear algebraic equation system of 5 variables is numerically solved, which is derived from the application of the Fourier transform to a differential equation system that allows modeling the behavior of the temperatures and the efficiencies of a hybrid solar receiver, which in simple terms is the combination of a photovoltaic system with a thermoelectric system. In addition, a way to reduce the previous system to a nonlinear system of only 2 variables is presented. Naturally, reducing algebraic equation systems of dimension N to systems of smaller dimensions has the main advantage of reducing the number of variables involved in a problem, but the analytical expressions of the systems become more complicated. However, to minimize this disadvantage, an iterative method that does not explicitly depend on the analytical complexity of the system to be solved is used. A fractional iterative method, valid for one and several variables, that uses the properties of fractional calculus, in particular the fact that the fractional derivatives of constants are not always zero, to find solutions of nonlinear systems is presented.
- [381] arXiv:2008.05412 (replaced) [pdf, html, other]
-
Title: A nonlinear system related to investment under uncertainty solved using the fractional pseudo-Newton methodJournal-ref: Journal of Mathematical Sciences: Advances and Applications, 63:41-53, 2020Subjects: Numerical Analysis (math.NA); Mathematical Physics (math-ph); Functional Analysis (math.FA); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
A nonlinear algebraic equation system of two variables is numerically solved, which is derived from a nonlinear algebraic equation system of four variables, that corresponds to a mathematical model related to investment under conditions of uncertainty. The theory of investment under uncertainty scenarios proposes a model to determine when a producer must expand or close, depending on his income. The system mentioned above is solved using a fractional iterative method, valid for one and several variables, that uses the properties of fractional calculus, in particular the fact that the fractional derivatives of constants are not always zero, to find solutions of nonlinear systems.
- [382] arXiv:2102.07401 (replaced) [pdf, other]
-
Title: Model-bounded monitoring of hybrid systemsComments: This is the author version of the manuscript of the same name published in the ACM Transactions on Cyber-Physical SystemsJournal-ref: ACM Transactions on Cyber-Physical Systems, Volume 6, Issue 4, Article No.: 30, Pages 1-26, 2022Subjects: Systems and Control (eess.SY); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
Monitoring of hybrid systems attracts both scientific and practical attention. However, monitoring algorithms suffer from the methodological difficulty of only observing sampled discrete-time signals, while real behaviors are continuous-time signals. To mitigate this problem of sampling uncertainties, we introduce a model-bounded monitoring scheme, where we use prior knowledge about the target system to prune interpolation candidates. Technically, we express such prior knowledge by linear hybrid automata (LHAs) -- the LHAs are called bounding models. We introduce a novel notion of monitored language of LHAs, and we reduce the monitoring problem to the membership problem of the monitored language. We present two partial algorithms -- one is via reduction to reachability in LHAs and the other is a direct one using polyhedra -- and show that these methods, and thus the proposed model-bounded monitoring scheme, are efficient and practically relevant.
- [383] arXiv:2111.06971 (replaced) [pdf, other]
-
Title: Exploiting All Samples in Low-Resource Sentence Classification: Early Stopping and Initialization ParametersComments: 15 pages, 8 figures, published in IEEE AccessJournal-ref: IEEE Access, vol. 11 (2023), pp. 30768-30782;Subjects: Computation and Language (cs.CL)
To improve deep-learning performance in low-resource settings, many researchers have redesigned model architectures or applied additional data (e.g., external resources, unlabeled samples). However, there have been relatively few discussions on how to make good use of small amounts of labeled samples, although it is potentially beneficial and should be done before applying additional data or redesigning models. In this study, we assume a low-resource setting in which only a few labeled samples (i.e., 30-100 per class) are available, and we discuss how to exploit them without additional data or model redesigns. We explore possible approaches in the following three aspects: training-validation splitting, early stopping, and weight initialization. Extensive experiments are conducted on six public sentence classification datasets. Performance on various evaluation metrics (e.g., accuracy, loss, and calibration error) significantly varied depending on the approaches that were combined in the three aspects. Based on the results, we propose an integrated method, which is to initialize the model with a weight averaging method and use a non-validation stop method to train all samples. This simple integrated method consistently outperforms the competitive methods; e.g., the average accuracy of six datasets of this method was 1.8% higher than those of conventional validation-based methods. In addition, the integrated method further improves the performance when adapted to several state-of-the-art models that use additional data or redesign the network architecture (e.g., self-training and enhanced structural models). Our results highlight the importance of the training strategy and suggest that the integrated method can be the first step in the low-resource setting. This study provides empirical knowledge that will be helpful when dealing with low-resource data in future efforts.
- [384] arXiv:2209.02935 (replaced) [pdf, other]
-
Title: Normalised clustering accuracy: An asymmetric external cluster validity measureJournal-ref: Journal of Classification, 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
- [385] arXiv:2209.11604 (replaced) [pdf, html, other]
-
Title: Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network CalibrationComments: Transactions on Machine Learning ResearchJournal-ref: Transactions on Machine Learning Research, 2024, ISSN 2835-8856, https://openreview.net/forum?id=qSFToMqLcqSubjects: Machine Learning (cs.LG)
Neural network calibration is an essential task in deep learning to ensure consistency between the confidence of model prediction and the true correctness likelihood. In this paper, we propose a new post-processing calibration method called Neural Clamping, which employs a simple joint input-output transformation on a pre-trained classifier via a learnable universal input perturbation and an output temperature scaling parameter. Moreover, we provide theoretical explanations on why Neural Clamping is provably better than temperature scaling. Evaluated on BloodMNIST, CIFAR-100, and ImageNet image recognition datasets and a variety of deep neural network models, our empirical results show that Neural Clamping significantly outperforms state-of-the-art post-processing calibration methods. The code is available at this http URL, and the demo is available at this http URL.
- [386] arXiv:2210.09735 (replaced) [pdf, html, other]
-
Title: Policy Gradient Methods for Designing Dynamic Output Feedback ControllersComments: Accepted in European Journal of ControlSubjects: Systems and Control (eess.SY)
This paper proposes model-based and model-free policy gradient methods (PGMs) for designing dynamic output feedback controllers for discrete-time partially observable systems. To fulfill this objective, we first show that any dynamic output feedback controller design is equivalent to a state-feedback controller design for a newly introduced system whose internal state is a finite-length input-output history (IOH). Next, based on this equivalency, we propose a model-based PGM and show its global linear convergence by proving that the Polyak-Lojasiewicz inequality holds for a reachability-based lossless projection of the IOH dynamics. Moreover, we propose two model-free implementations of the PGM: the multi- and single-episodic PGM. The former is a Monte Carlo approximation of the model-based PGM, whereas the latter is a simplified version of the former for ease of use in real systems. A sample complexity analysis of both methods is also presented. Finally, the effectiveness of the model-based/model-free PGMs is investigated through a numerical simulation.
- [387] arXiv:2211.07978 (replaced) [pdf, other]
-
Title: Shellability is hard even for ballsComments: 66 pages, 51 figures (The figures take nontrivial portion of the space thus in reality the paper is a bit shorter.)Subjects: Computational Geometry (cs.CG); Combinatorics (math.CO); Geometric Topology (math.GT)
The main goal of this paper is to show that shellability is NP-hard for triangulated d-balls (this also gives hardness for triangulated d-manifolds/d-pseudomanifolds with boundary) as soon as d is at least 3. This extends our earlier work with Goaoc, Patáková and Wagner on hardness of shellability of 2-complexes and answers some questions implicitly raised by Danaraj and Klee in 1978 and explicitly mentioned by Santamaría-Galvis and Woodroofe. Together with the main goal, we also prove that collapsibility is NP-hard for 3-complexes embeddable in the 3-space, extending an earlier work of the second author and answering an open question mentioned by Cohen, Fasy, Miller, Nayyeri, Peng and Walkington; and that shellability is NP-hard for 2-complexes embeddable in the 3-space, answering another question of Santamaría-Galvis and Woodroofe (in a slightly stronger form than what is given by the main result).
- [388] arXiv:2211.10526 (replaced) [pdf, html, other]
-
Title: Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer InferenceHaoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, Yingyan Celine LinComments: CVPR 2023 Camera ReadySubjects: Computer Vision and Pattern Recognition (cs.CV)
Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.
- [389] arXiv:2212.03117 (replaced) [pdf, html, other]
-
Title: Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-SnapshotsComments: 20 pages, 15 figuresSubjects: Machine Learning (cs.LG)
Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.
- [390] arXiv:2212.06543 (replaced) [pdf, html, other]
-
Title: Improving Stance Detection by Leveraging Measurement Knowledge from Social Sciences: A Case Study of Dutch Political Tweets and Traditional Gender Role DivisionSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
Stance detection (SD) concerns automatically determining the viewpoint (i.e., in favour of, against, or neutral) of a text's author towards a target. SD has been applied to many research topics, among which the detection of stances behind political tweets is an important one. In this paper, we apply SD to a dataset of tweets from official party accounts in the Netherlands between 2017 and 2021, with a focus on stances towards traditional gender role division, a dividing issue between (some) Dutch political parties. To implement and improve SD of traditional gender role division, we propose to leverage an established survey instrument from social sciences, which has been validated for the purpose of measuring attitudes towards traditional gender role division. Based on our experiments, we show that using such a validated survey instrument helps to improve SD performance.
- [391] arXiv:2212.14729 (replaced) [pdf, html, other]
-
Title: Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory RequirementsBenjamin Berger (Leibniz Universität Hannover), Victor Uc Cetina (Universidad Autónoma de Yucatán)Comments: 17 pages (12 without appendices), 12 figures, 5 tablesSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
- [392] arXiv:2301.08403 (replaced) [pdf, html, other]
-
Title: One-shot Generative Distribution Matching for Augmented RF-based UAV IdentificationComments: 31 pages, 7 figures, 4 tablesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Applications (stat.AP); Machine Learning (stat.ML)
This work addresses the challenge of identifying Unmanned Aerial Vehicles (UAV) using radiofrequency (RF) fingerprinting in limited RF environments. The complexity and variability of RF signals, influenced by environmental interference and hardware imperfections, often render traditional RF-based identification methods ineffective. To address these complications, the study introduces the rigorous use of one-shot generative methods for augmenting transformed RF signals, offering a significant improvement in UAV identification. This approach shows promise in low-data regimes, outperforming deep generative methods like conditional generative adversarial networks (GANs) and variational auto-encoders (VAEs). The paper provides a theoretical guarantee for the effectiveness of one-shot generative models in augmenting limited data, setting a precedent for their application in limited RF environments. This research contributes to learning techniques in low-data regime scenarios, which may include atypical complex sequences beyond images and videos. The code and links to datasets used in this study are available at this https URL.
- [393] arXiv:2302.00390 (replaced) [pdf, html, other]
-
Title: Hierarchical Classification of Research Fields in the "Web of Science" Using Deep LearningComments: Under minor revision at QSSSubjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.
- [394] arXiv:2302.00509 (replaced) [pdf, html, other]
-
Title: Exploring Semantic Perturbations on GroverSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
With news and information being as easy to access as they currently are, it is more important than ever to ensure that people are not mislead by what they read. Recently, the rise of neural fake news (AI-generated fake news) and its demonstrated effectiveness at fooling humans has prompted the development of models to detect it. One such model is the Grover model, which can both detect neural fake news to prevent it, and generate it to demonstrate how a model could be misused to fool human readers. In this work we explore the Grover model's fake news detection capabilities by performing targeted attacks through perturbations on input news articles. Through this we test Grover's resilience to these adversarial attacks and expose some potential vulnerabilities which should be addressed in further iterations to ensure it can detect all types of fake news accurately.
- [395] arXiv:2304.03183 (replaced) [pdf, html, other]
-
Title: History-deterministic Timed AutomataSubjects: Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
We explore the notion of history-determinism in the context of timed automata (TA) over infinite timed words. History-deterministic (HD) automata are those in which nondeterminism can be resolved on the fly, based on the run constructed thus far. History-determinism is a robust property that admits different game-based characterisations, and HD specifications allow for game-based verification without an expensive determinization step.
We show that the class of timed $\omega$-languages recognized by HD timed automata strictly extends that of deterministic ones, and is strictly included in those recognised by fully non-deterministic TA.
For non-deterministic timed automata it is known that universality is already undecidable for safety/reachability TA. For history-deterministic TA with arbitrary parity acceptance, we show that timed universality, inclusion, and synthesis all remain decidable and are EXPTIME-complete.
For the subclass of TA with safety or reachability acceptance, one can decide (in EXPTIME) whether such an automaton is history-deterministic. If so, it can effectively determinized without introducing new automata states. - [396] arXiv:2304.10858 (replaced) [pdf, other]
-
Title: A Physical-Layer Orchestration Framework for Open System Models of Autonomous RISsVictor Croisfelt, Francesco Devoti, Fabio Saggese, Vincenzo Sciancalepore, Xavier Costa-Pérez, Petar PopovskiComments: 16 pages, 9 figures, submitted to an IEEE Transactions on Wireless CommunicationsSubjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY)
To obviate the control of reconfigurable intelligent surfaces (RISs) and related overhead, recent works envisioned the concept of autonomous RISs: Intelligent devices capable of autonomously deciding their reflection states. This paradigm is enabled by hybrid RIS (HRIS), a hardware solution that integrates sensing and channel estimation (CHEST) capabilities, enabling autonomous operation. Autonomous RISs operate independently alongside other network entities, promoting open communication system models for RISs. However, this autonomy introduces a significant challenge: How to schedule the operation of the autonomous RIS over time and frequency to minimize its impact on the network? This paper introduces a physical layer (PHY) orchestration framework within a massive multiple-input multiple-output (mMIMO) network to address this challenge. Our framework highlights an engineering trade-off termed the "autonomous RIS trade-off," which examines the performance implications of autonomy. Through mathematical analysis and numerical results, we describe the HRIS feasibility region, illustrating the conditions under which autonomous RISs are viable when properly deployed.
- [397] arXiv:2304.11125 (replaced) [pdf, html, other]
-
Title: Implementing and Evaluating Security in O-RAN: Interfaces, Intelligence, and PlatformsJoshua Groen, Salvatore DOro, Utku Demir, Leonardo Bonati, Michele Polese, Tommaso Melodia, Kaushik ChowdhuryComments: 8 pages, 5 figures, 1 table, submitted to IEEE Network MagazineJournal-ref: IEEE Network Magazine 2024Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY)
The Open Radio Access Network (RAN) is a networking paradigm that builds on top of cloud-based, multi-vendor, open and intelligent architectures to shape the next generation of cellular networks for 5G and beyond. While this new paradigm comes with many advantages in terms of observatibility and reconfigurability of the network, it inevitably expands the threat surface of cellular systems and can potentially expose its components to several cyber attacks, thus making securing O-RAN networks a necessity. In this paper, we explore the security aspects of O-RAN systems by focusing on the specifications and architectures proposed by the O-RAN Alliance. We address the problem of securing O-RAN systems with a holistic perspective, including considerations on the open interfaces used to interconnect the different O-RAN components, on the overall platform, and on the intelligence used to monitor and control the network. For each focus area we identify threats, discuss relevant solutions to address these issues, and demonstrate experimentally how such solutions can effectively defend O-RAN systems against selected cyber attacks. This article is the first work in approaching the security aspect of O-RAN holistically and with experimental evidence obtained on a state-of-the-art programmable O-RAN platform, thus providing unique guideline for researchers in the field.
- [398] arXiv:2305.01648 (replaced) [pdf, html, other]
-
Title: Manipulator as a Tail: Promoting Dynamic Stability for Legged LocomotionSubjects: Robotics (cs.RO)
For locomotion, is an arm on a legged robot a liability or an asset for locomotion? Biological systems evolved additional limbs beyond legs that facilitates postural control. This work shows how a manipulator can be an asset for legged locomotion at high speeds or under external perturbations, where the arm serves beyond manipulation. Since the system has 15 degrees of freedom (twelve for the legged robot and three for the arm), off-the-shelf reinforcement learning (RL) algorithms struggle to learn effective locomotion policies. Inspired by Bernstein's neurophysiological theory of animal motor learning, we develop an incremental training procedure that initially freezes some degrees of freedom and gradually releases them, using behaviour cloning (BC) from an early learning procedure to guide optimization in later learning. Simulation experiments show that our policy increases the success rate by up to 61 percentage points over the baselines. Simulation and real robot experiments suggest that our policy learns to use the arm as a tail to initiate robot turning at high speeds and to stabilize the quadruped under external perturbations. Quantitatively, in simulation experiments, we cut the failure rate up to 43.6% during high-speed turning and up to 31.8% for quadruped under external forces compared to using a locked arm.
- [399] arXiv:2305.13067 (replaced) [pdf, html, other]
-
Title: Distilling Robustness into Natural Language Inference Models with Domain-Targeted AugmentationComments: Accepted at ACL Findings 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Knowledge distillation optimises a smaller student model to behave similarly to a larger teacher model, retaining some of the performance benefits. While this method can improve results on in-distribution examples, it does not necessarily generalise to out-of-distribution (OOD) settings. We investigate two complementary methods for improving the robustness of the resulting student models on OOD domains. The first approach augments the distillation with generated unlabelled examples that match the target distribution. The second method upsamples data points among the training set that are similar to the target distribution. When applied on the task of natural language inference (NLI), our experiments on MNLI show that distillation with these modifications outperforms previous robustness solutions. We also find that these methods improve performance on OOD domains even beyond the target domain.
- [400] arXiv:2306.17103 (replaced) [pdf, html, other]
-
Title: LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPTLe Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wei Xue, Yike GuoComments: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
- [401] arXiv:2307.00146 (replaced) [pdf, html, other]
-
Title: Bluefish: A Relational Framework for Graphic RepresentationsComments: 27 pages, 14 figuresSubjects: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
Diagrams are essential tools for problem-solving and communication as they externalize conceptual structures using spatial relationships. But when picking a diagramming framework, users are faced with a dilemma. They can either use a highly expressive but low-level toolkit, whose API does not match their domain-specific concepts, or select a high-level typology, which offers a recognizable vocabulary but supports a limited range of diagrams. To address this gap, we introduce Bluefish: a diagramming framework inspired by component-based user interface (UI) libraries. Bluefish lets users create diagrams using relations: declarative, composable, and extensible diagram fragments that relax the concept of a UI component. Unlike a component, a relation does not have sole ownership over its children nor does it need to fully specify their layout. To render diagrams, Bluefish extends a traditional tree-based scenegraph to a compound graph that captures both hierarchical and adjacent relationships between nodes. To evaluate our system, we construct a diverse example gallery covering many domains including mathematics, physics, computer science, and even cooking. We show that Bluefish's relations are effective declarative primitives for diagrams. Bluefish is open source, and we aim to shape it into both a usable tool and a research platform.
- [402] arXiv:2307.06036 (replaced) [pdf, html, other]
-
Title: Information Rate-Harvested Power Tradeoff in THz SWIPT Systems Employing Resonant Tunnelling Diode-based EH CircuitsNikita Shanin, Simone Clochiatti, Kenneth M. Mayer, Laura Cottatellucci, Nils Weimann, Robert SchoberComments: 32 pages, 10 figures, submitted for a possible journal publicationSubjects: Information Theory (cs.IT)
In this paper, we study THz simultaneous wireless information and power transfer (SWIPT) systems. Since coherent information detection is challenging at THz frequencies and Schottky diodes may not be efficient for THz energy harvesting (EH) and information detection, we employ unipolar amplitude shift keying (ASK) modulation at the transmitter (TX) and a resonant tunnelling diode (RTD)-based EH circuit at the receiver (RX) to extract both information and power from the RX signal. We model the dependence of the instantaneous output power at the RX on the instantaneous received power by a non-linear piecewise function, whose parameters are adjusted to fit circuit simulation results. To determine the rate-power tradeoff in THz SWIPT systems, we derive the distribution of the TX signal that maximizes the mutual information between the TX and RX signals subject to constraints on the required average harvested power at the RX and the peak signal amplitude at the TX. Since the computational complexity of maximizing the mutual information may be too high for real-time THz SWIPT systems, for high and low required average harvested powers, we also obtain the suboptimal input signal distribution that maximizes the achievable information rate numerically and in closed form, respectively. Furthermore, based on the obtained results, we propose a suboptimal closed-form TX distribution which also achieves a desired harvested power at the RX. Our simulation results show that a lower reverse current flow and a higher breakdown voltage of the employed RTD are preferable when the input signal power at the RX is low and high, respectively. Finally, we demonstrate that for low and high received signal powers, the rate-power tradeoff of THz SWIPT systems is determined by the peak amplitude of the TX signal and the maximum instantaneous harvested power, respectively.
- [403] arXiv:2307.16734 (replaced) [pdf, html, other]
-
Title: Stochastic Filtering of Reaction Networks Partially Observed in Time SnapshotsJournal-ref: Journal of Computational Physics Volume 515, 15 October 2024, 113265Subjects: Numerical Analysis (math.NA); Probability (math.PR)
Stochastic reaction network models arise in intracellular chemical reactions, epidemiological models and other population process models, and are a class of continuous time Markov chains which have the nonnegative integer lattice as state space. We consider the problem of estimating the conditional probability distribution of a stochastic reaction network given exact partial state observations in time snapshots. We propose a particle filtering method called the targeting method. Our approach takes into account that the reaction counts in between two observation snapshots satisfy linear constraints and also uses inhomogeneous Poisson processes as proposals for the reaction counts to facilitate exact interpolation. We provide rigorous analysis as well as numerical examples to illustrate our method and compare it with other alternatives.
- [404] arXiv:2308.12112 (replaced) [pdf, html, other]
-
Title: Category Adaptation Meets Projected Distillation in Generalized Continual Category DiscoveryComments: Accepted for ECCV 2024Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Generalized Continual Category Discovery (GCCD) tackles learning from sequentially arriving, partially labeled datasets while uncovering new categories. Traditional methods depend on feature distillation to prevent forgetting the old knowledge. However, this strategy restricts the model's ability to adapt and effectively distinguish new categories. To address this, we introduce a novel technique integrating a learnable projector with feature distillation, thus enhancing model adaptability without sacrificing past knowledge. The resulting distribution shift of the previously learned categories is mitigated with the auxiliary category adaptation network. We demonstrate that while each component offers modest benefits individually, their combination - dubbed CAMP (Category Adaptation Meets Projected distillation) - significantly improves the balance between learning new information and retaining old. CAMP exhibits superior performance across several GCCD and Class Incremental Learning scenarios. The code is available at this https URL.
- [405] arXiv:2308.16884 (replaced) [pdf, html, other]
-
Title: The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsLucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian KhabsaComments: ACL 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.
- [406] arXiv:2309.08499 (replaced) [pdf, html, other]
-
Title: POCKET: Pruning Random Convolution Kernels for Time Series Classification from a Feature Selection PerspectiveJournal-ref: Knowledge-Based Systems,Volume 300,2024,112253Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In recent years, two competitive time series classification models, namely, ROCKET and MINIROCKET, have garnered considerable attention due to their low training cost and high accuracy. However, they rely on a large number of random 1-D convolutional kernels to comprehensively capture features, which is incompatible with resource-constrained devices. Despite the development of heuristic algorithms designed to recognize and prune redundant kernels, the inherent time-consuming nature of evolutionary algorithms hinders efficient evaluation. To efficiently prune models, this paper eliminates feature groups contributing minimally to the classifier, thereby discarding the associated random kernels without direct evaluation. To this end, we incorporate both group-level ($l_{2,1}$-norm) and element-level ($l_2$-norm) regularizations to the classifier, formulating the pruning challenge as a group elastic net classification problem. An ADMM-based algorithm is initially introduced to solve the problem, but it is computationally intensive. Building on the ADMM-based algorithm, we then propose our core algorithm, POCKET, which significantly speeds up the process by dividing the task into two sequential stages. In Stage 1, POCKET utilizes dynamically varying penalties to efficiently achieve group sparsity within the classifier, removing features associated with zero weights and their corresponding kernels. In Stage 2, the remaining kernels and features are used to refit a $l_2$-regularized classifier for enhanced performance. Experimental results on diverse time series datasets show that POCKET prunes up to 60% of kernels without a significant reduction in accuracy and performs 11$\times$ faster than its counterparts. Our code is publicly available at this https URL.
- [407] arXiv:2309.09530 (replaced) [pdf, html, other]
-
Title: Adapting Large Language Models to Domains via Reading ComprehensionComments: ICLR 2024 ConferenceSubjects: Computation and Language (cs.CL)
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data are available at this https URL.
- [408] arXiv:2309.10527 (replaced) [pdf, html, other]
-
Title: SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D RepresentationsXiangchao Yan, Runjian Chen, Bo Zhang, Hancheng Ye, Renqiu Xia, Jiakang Yuan, Hongbin Zhou, Xinyu Cai, Botian Shi, Wenqi Shao, Ping Luo, Yu Qiao, Tao Chen, Junchi YanComments: 15 pages, 8 figures, Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g., autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
- [409] arXiv:2309.12865 (replaced) [pdf, html, other]
-
Title: Bridging Sensor Gaps via Attention Gated Tuning for Hyperspectral Image ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Data-hungry HSI classification methods require high-quality labeled HSIs, which are often costly to obtain. This characteristic limits the performance potential of data-driven methods when dealing with limited annotated samples. Bridging the domain gap between data acquired from different sensors allows us to utilize abundant labeled data across sensors to break this bottleneck. In this paper, we propose a novel Attention-Gated Tuning (AGT) strategy and a triplet-structured transformer model, Tri-Former, to address this issue. The AGT strategy serves as a bridge, allowing us to leverage existing labeled HSI datasets, even RGB datasets to enhance the performance on new HSI datasets with limited samples. Instead of inserting additional parameters inside the basic model, we train a lightweight auxiliary branch that takes intermediate features as input from the basic model and makes predictions. The proposed AGT resolves conflicts between heterogeneous and even cross-modal data by suppressing the disturbing information and enhances the useful information through a soft gate. Additionally, we introduce Tri-Former, a triplet-structured transformer with a spectral-spatial separation design that enhances parameter utilization and computational efficiency, enabling easier and flexible fine-tuning. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed AGT. Code has been released at: \href{this https URL}{this https URL}.
- [410] arXiv:2309.16228 (replaced) [pdf, other]
-
Title: Brand Network Booster: A new system for improving brand connectivityJournal-ref: Computers & Industrial Engineering 194, 110389 (2024)Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL); Software Engineering (cs.SE); Physics and Society (physics.soc-ph)
This paper presents a new decision support system offered for an in-depth analysis of semantic networks, which can provide insights for a better exploration of a brand's image and the improvement of its connectivity. In terms of network analysis, we show that this goal is achieved by solving an extended version of the Maximum Betweenness Improvement problem, which includes the possibility of considering adversarial nodes, constrained budgets, and weighted networks - where connectivity improvement can be obtained by adding links or increasing the weight of existing connections. Our contribution includes a new algorithmic framework and the integration of this framework into a software system called Brand Network Booster (BNB), which supports brand connectivity evaluation and improvement. We present this new system together with three case studies, and we also discuss its performance. Our tool and approach are valuable to both network scholars and in facilitating strategic decision-making processes for marketing and communication managers across various sectors, be it public or private.
- [411] arXiv:2309.16571 (replaced) [pdf, other]
-
Title: Review of Machine Learning Methods for Additive Manufacturing of Functionally Graded MaterialsMohammad Karimzadeh, Deekshith Basvoju, Aleksandar Vakanski, Indrajit Charit, Fei Xu, Xinchang ZhangComments: 23 pagesJournal-ref: Materials. 2024; 17(15):3673Subjects: Machine Learning (cs.LG)
Additive Manufacturing (AM) is a transformative manufacturing technology enabling direct fabrication of complex parts layer-be-layer from 3D modeling data. Among AM applications, the fabrication of Functionally Graded Materials (FGMs) has significant importance due to the potential to enhance component performance across several industries. FGMs are manufactured with a gradient composition transition between dissimilar materials, enabling the design of new materials with location-dependent mechanical and physical properties. This study presents a comprehensive review of published literature pertaining to the implementation of Machine Learning (ML) techniques in AM, with an emphasis on ML-based methods for optimizing FGMs fabrication processes. Through an extensive survey of the literature, this review article explores the role of ML in addressing the inherent challenges in FGMs fabrication and encompasses parameter optimization, defect detection, and real-time monitoring. The article also provides a discussion of future research directions and challenges in employing ML-based methods in AM fabrication of FGMs.
- [412] arXiv:2310.05483 (replaced) [pdf, html, other]
-
Title: HarmonicNeRF: Geometry-Informed Synthetic View Augmentation for 3D Scene Reconstruction in Driving ScenariosComments: Accepted by ACM MM 2024, project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In the realm of autonomous driving, achieving precise 3D reconstruction of the driving environment is critical for ensuring safety and effective navigation. Neural Radiance Fields (NeRF) have shown promise in creating highly detailed and accurate models of complex environments. However, the application of NeRF in autonomous driving scenarios encounters several challenges, primarily due to the sparsity of viewpoints inherent in camera trajectories and the constraints on data collection in unbounded outdoor scenes, which typically occur along predetermined paths. This limitation not only reduces the available scene information but also poses significant challenges for NeRF training, as the sparse and path-distributed observational data leads to under-representation of the scene's geometry. In this paper, we introduce HarmonicNeRF, a novel approach for outdoor self-supervised monocular scene reconstruction. HarmonicNeRF capitalizes on the strengths of NeRF and enhances surface reconstruction accuracy by augmenting the input space with geometry-informed synthetic views. This is achieved through the application of spherical harmonics to generate novel radiance values, taking into careful consideration the color observations from the limited available real-world views. Additionally, our method incorporates proxy geometry to effectively manage occlusion, generating radiance pseudo-labels that circumvent the limitations of traditional image-warping techniques, which often fail in sparse data conditions typical of autonomous driving environments. Extensive experiments conducted on the KITTI, Argoverse, and NuScenes datasets demonstrate our approach establishes new benchmarks in synthesizing novel depth views and reconstructing scenes, significantly outperforming existing methods. Project page: this https URL
- [413] arXiv:2310.05989 (replaced) [pdf, html, other]
-
Title: QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied ContextsComments: Accepted by ACM MM 2024, project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D object detection plays a pivotal role in autonomous driving and robotics, demanding precise interpretation of Bird's Eye View (BEV) images. The dynamic nature of real-world environments necessitates the use of dynamic query mechanisms in 3D object detection to adaptively capture and process the complex spatio-temporal relationships present in these scenes. However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce a framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. By dynamically segmenting the BEV space and prioritizing key features through Top-K attention, our model achieves a real-time, focused analysis of pertinent scene elements. Our extensive evaluation on the nuScenes and Waymo dataset showcases a marked improvement in detection accuracy, setting a new benchmark in the domain of query-based BEV object detection. Our dynamic query evolution strategy has the potential to push the boundaries of current BEV methods with enhanced adaptability and computational efficiency. Project page: this https URL
- [414] arXiv:2310.06491 (replaced) [pdf, html, other]
-
Title: Bridging Items and Language: A Transition Paradigm for Large Language Model-Based RecommendationComments: Accepted by KDD 2024Subjects: Information Retrieval (cs.IR)
Harnessing Large Language Models (LLMs) for recommendation is rapidly emerging, which relies on two fundamental steps to bridge the recommendation item space and the language space: 1) item indexing utilizes identifiers to represent items in the language space, and 2) generation grounding associates LLMs' generated token sequences to in-corpus items. However, previous methods exhibit inherent limitations in the two steps. Existing ID-based identifiers (e.g., numeric IDs) and description-based identifiers (e.g., titles) either lose semantics or lack adequate distinctiveness. Moreover, prior generation grounding methods might generate invalid identifiers, thus misaligning with in-corpus items. To address these issues, we propose a novel Transition paradigm for LLM-based Recommender (named TransRec) to bridge items and language. Specifically, TransRec presents multi-facet identifiers, which simultaneously incorporate ID, title, and attribute for item indexing to pursue both distinctiveness and semantics. Additionally, we introduce a specialized data structure for TransRec to ensure generating valid identifiers only and utilize substring indexing to encourage LLMs to generate from any position of identifiers. Lastly, TransRec presents an aggregated grounding module to leverage generated multi-facet identifiers to rank in-corpus items efficiently. We instantiate TransRec on two backbone models, BART-large and LLaMA-7B. Extensive results on three real-world datasets under diverse settings validate the superiority of TransRec.
- [415] arXiv:2310.10985 (replaced) [pdf, other]
-
Title: Computational synthesis of locomotive soft robots by topology optimizationHiroki Kobayashi, Farzad Gholami, S. Macrae Montgomery, Masato Tanaka, Liang Yue, Changyoung Yuhn, Yuki Sato, Atsushi Kawamoto, H. Jerry Qi, Tsuyoshi NomuraComments: 36 total pages (27 pages, 9 supplementary pages), 5 Figures, 9 Supplementary figures. 1 Supplementary tableJournal-ref: Sci. Adv. 10, eadn6129 (2024)Subjects: Computational Engineering, Finance, and Science (cs.CE)
Locomotive soft robots (SoRos) have gained prominence due to their adaptability. Traditional locomotive SoRo design is based on limb structures inspired by biological organisms and requires human intervention. Evolutionary robotics, designed using evolutionary algorithms (EAs), have shown potential for automatic design. However, EA-based methods face the challenge of high computational cost when considering multiphysics in locomotion, including materials, actuations, and interactions with environments. Here, we present a design approach for pneumatic SoRos that integrates gradient-based topology optimization with multiphysics material point method (MPM) simulations. This approach starts with a simple initial shape (a cube with a central cavity). The topology optimization with MPM then automatically and iteratively designs the SoRo shape. We design two SoRos, one for walking and one for climbing. These SoRos are 3D printed and exhibit the same locomotion features as in the simulations. This study presents an efficient strategy for designing SoRos, demonstrating that a purely mathematical process can produce limb-like structures seen in biological organisms.
- [416] arXiv:2310.20498 (replaced) [pdf, html, other]
-
Title: Generative Learning of Continuous Data by Tensor NetworksAlex Meiburg, Jing Chen, Jacob Miller, Raphaëlle Tihon, Guillaume Rabusseau, Alejandro Perdomo-OrtizComments: 21 pages, 15 figuresSubjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph); Machine Learning (stat.ML)
Beyond their origin in modeling many-body quantum systems, tensor networks have emerged as a promising class of models for solving machine learning problems, notably in unsupervised generative learning. While possessing many desirable features arising from their quantum-inspired nature, tensor network generative models have previously been largely restricted to binary or categorical data, limiting their utility in real-world modeling problems. We overcome this by introducing a new family of tensor network generative models for continuous data, which are capable of learning from distributions containing continuous random variables. We develop our method in the setting of matrix product states, first deriving a universal expressivity theorem proving the ability of this model family to approximate any reasonably smooth probability density function with arbitrary precision. We then benchmark the performance of this model on several synthetic and real-world datasets, finding that the model learns and generalizes well on distributions of continuous and discrete variables. We develop methods for modeling different data domains, and introduce a trainable compression layer which is found to increase model performance given limited memory or computational resources. Overall, our methods give important theoretical and empirical evidence of the efficacy of quantum-inspired methods for the rapidly growing field of generative learning.
- [417] arXiv:2311.02879 (replaced) [pdf, other]
-
Title: Exploring Active Learning in Meta-Learning: Enhancing Context Set LabelingComments: Accepted to ECCV2024Subjects: Machine Learning (cs.LG)
Most meta-learning methods assume that the (very small) context set used to establish a new task at test time is passively provided. In some settings, however, it is feasible to actively select which points to label; the potential gain from a careful choice is substantial, but the setting requires major differences from typical active learning setups. We clarify the ways in which active meta-learning can be used to label a context set, depending on which parts of the meta-learning process use active learning. Within this framework, we propose a natural algorithm based on fitting Gaussian mixtures for selecting which points to label; though simple, the algorithm also has theoretical motivation. The proposed algorithm outperforms state-of-the-art active learning methods when used with various meta-learning algorithms across several benchmark datasets.
- [418] arXiv:2311.05869 (replaced) [pdf, other]
-
Title: Practical one-shot data-driven design of fractional-order PID controller: Fictitious reference signal approachComments: Published in ISA Transactions (this https URL)Subjects: Systems and Control (eess.SY)
This study proposes a one-shot data-driven tuning method for a fractional-order proportional-integral-derivative (FOPID) controller. The proposed method tunes the FOPID controller in the model-reference control formulation. A loss function is defined to evaluate the match between a given reference model and the closed-loop response while explicitly considering the closed-loop stability. A loss function value is based on the fictitious reference signal computed using the input/output data. Model matching is achieved via loss function minimization. The proposed method is simple and practical: it needs only one-shot input/output data of a plant (no plant model required), considers the bounded-input bounded-output stability of the closed-loop system from a bounded reference input to a bounded output, and automatically determines the appropriate parameter value via optimization. Numerical simulations show that the proposed approach facilitates good control performance, and destabilization can be avoided even if perfect model matching is unachievable.
- [419] arXiv:2311.06851 (replaced) [pdf, html, other]
-
Title: Automatic Textual Normalization for Hate Speech DetectionComments: 2023 International Conference on Intelligent Systems Design and Applications (ISDA2023)Journal-ref: Intelligent Systems Design and Applications. Lecture Notes in Networks and Systems, vol 1049 (ISDA 2023) 1-12Subjects: Computation and Language (cs.CL)
Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.
- [420] arXiv:2311.07052 (replaced) [pdf, html, other]
-
Title: Towards the Law of Capacity Gap in Distilling Language ModelsComments: 32 pages, 10 figures, 15 tables, work in progress. Code and checkpoints are available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted \textit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
- [421] arXiv:2311.13976 (replaced) [pdf, html, other]
-
Title: Low Latency Instance Segmentation by Continuous Clustering for LiDAR SensorsComments: Accompanying Video: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Low-latency instance segmentation of LiDAR point clouds is crucial in real-world applications because it serves as an initial and frequently-used building block in a robot's perception pipeline, where every task adds further delay. Particularly in dynamic environments, this total delay can result in significant positional offsets of dynamic objects, as seen in highway scenarios. To address this issue, we employ a new technique, which we call continuous clustering. Unlike most existing clustering approaches, which use a full revolution of the LiDAR sensor, we process the data stream in a continuous and seamless fashion. Our approach does not rely on the concept of complete or partial sensor rotations with multiple discrete range images; instead, it views the range image as a single and infinitely horizontally growing entity. Each new column of this continuous range image is processed as soon it is available. Obstacle points are clustered to existing instances in real-time and it is checked at a high-frequency which instances are completed in order to publish them without waiting for the completion of the revolution or some other integration period. In the case of rotating sensors, no problematic discontinuities between the points of the end and the start of a scan are observed. In this work we describe the two-layered data structure and the corresponding algorithm for continuous clustering. It is able to achieve an average latency of just 5 ms with respect to the latest timestamp of all points in the cluster. We are publishing the source code at this https URL.
- [422] arXiv:2312.02878 (replaced) [pdf, html, other]
-
Title: Towards More Practical Group Activity Detection: A New Benchmark and ModelComments: Accepted to ECCV 2024, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed.
- [423] arXiv:2312.03853 (replaced) [pdf, html, other]
-
Title: Dr. Jekyll and Mr. Hyde: Two Faces of LLMsSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Gemini (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Gemini. We also introduce several ways of activating such adversarial personas, which show that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
- [424] arXiv:2312.07423 (replaced) [pdf, html, other]
-
Title: Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB CamerasComments: Project page: this https URL 8 pages, 2 tables and 8 figures; presented at Computer Vision and Pattern Recognition (CVPR) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel, from sparse multi-view recording to display, in real-time at an unprecedented 4K resolution. At inference, our method only requires four camera views of the moving actor and the respective 3D skeletal pose. It handles actors in wide clothing, and reproduces even fine-scale dynamic detail, e.g. clothing wrinkles, face expressions, and hand gestures. At training time, our learning-based approach expects dense multi-view video and a rigged static surface scan of the actor. Our method comprises three main stages. Stage 1 is a skeleton-driven neural approach for high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel solution to create a view-dependent texture using four test-time camera views as input. Finally, stage 3 comprises a new image-based refinement network rendering the final 4K image given the output from the previous stages. Our approach establishes a new benchmark for real-time rendering resolution and quality using sparse input camera views, unlocking possibilities for immersive telepresence.
- [425] arXiv:2312.07537 (replaced) [pdf, html, other]
-
Title: FreeInit: Bridging Initialization Gap in Video Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
- [426] arXiv:2312.12111 (replaced) [pdf, html, other]
-
Title: General-Purpose User Modeling with Behavioral Logs: A Snapchat Case StudyQixiang Fang, Zhihan Zhou, Francesco Barbieri, Yozen Liu, Leonardo Neves, Dong Nguyen, Daniel L. Oberski, Maarten W. Bos, Ron DotschComments: SIGIR 2024Subjects: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
Learning general-purpose user representations based on user behavioral logs is an increasingly popular user modeling approach. It benefits from easily available, privacy-friendly yet expressive data, and does not require extensive re-tuning of the upstream user model for different downstream tasks. While this approach has shown promise in search engines and e-commerce applications, its fit for instant messaging platforms, a cornerstone of modern digital communication, remains largely uncharted. We explore this research gap using Snapchat data as a case study. Specifically, we implement a Transformer-based user model with customized training objectives and show that the model can produce high-quality user representations across a broad range of evaluation tasks, among which we introduce three new downstream tasks that concern pivotal topics in user research: user safety, engagement and churn. We also tackle the challenge of efficient extrapolation of long sequences at inference time, by applying a novel positional encoding method.
- [427] arXiv:2312.13317 (replaced) [pdf, html, other]
-
Title: Deep Hybrid Camera Deblurring for Smartphone CamerasComments: SIGGRAPH 2024, Project page: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Mobile cameras, despite their significant advancements, still have difficulty in low-light imaging due to compact sensors and lenses, leading to longer exposures and motion blur. Traditional blind deconvolution methods and learning-based deblurring methods can be potential solutions to remove blur. However, achieving practical performance still remains a challenge. To address this, we propose a learning-based deblurring framework for smartphones, utilizing wide and ultra-wide cameras as a hybrid camera system. We simultaneously capture a long-exposure wide image and short-exposure burst ultra-wide images, and utilize the burst images to deblur the wide image. To fully exploit burst ultra-wide images, we present HCDeblur, a practical deblurring framework that includes novel deblurring networks, HC-DNet and HC-FNet. HC-DNet utilizes motion information extracted from burst images to deblur a wide image, and HC-FNet leverages burst images as reference images to further enhance a deblurred output. For training and evaluating the proposed method, we introduce the HCBlur dataset, which consists of synthetic and real-world datasets. Our experiments demonstrate that HCDeblur achieves state-of-the-art deblurring quality. Code and datasets are available at this https URL.
- [428] arXiv:2312.14428 (replaced) [pdf, html, other]
-
Title: A Unified Industrial Large Knowledge Model Framework in Industry 4.0 and Smart ManufacturingComments: The paper submitted to the International Journal of AI for Materials and Design (IJAMD) has been accepted and is now available onlineSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The recent emergence of large language models (LLMs) demonstrates the potential for artificial general intelligence, revealing new opportunities in Industry 4.0 and smart manufacturing. However, a notable gap exists in applying these LLMs in industry, primarily due to their training on general knowledge rather than domain-specific knowledge. Such specialized domain knowledge is vital for effectively addressing the complex needs of industrial applications. To bridge this gap, this paper proposes a unified industrial large knowledge model (ILKM) framework, emphasizing its potential to revolutionize future industries. In addition, ILKMs and LLMs are compared from eight perspectives. Finally, the "6S Principle" is proposed as the guideline for ILKM development, and several potential opportunities are highlighted for ILKM deployment in Industry 4.0 and smart manufacturing.
- [429] arXiv:2401.00659 (replaced) [pdf, html, other]
-
Title: Cost-effective Datasets Discovery: When Distinctiveness MattersSubjects: Databases (cs.DB)
In this paper, we aim to find a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness), driven by a user's query set and a budget limit. We prove this problem to be NP-hard and, subsequently, we develop a greedy algorithm that attains an approximation ratio of (1-1/e)/2. However, this algorithm lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection, which requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset that effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods limited to single-query cardinality estimation on a single dataset that fall short in identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm utilizing ML-based distinctiveness estimation outperforms all relevant baselines in terms of both effectiveness and efficiency.
- [430] arXiv:2401.00773 (replaced) [pdf, html, other]
-
Title: Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process MixturesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Probabilistic mixture models are recognized as effective tools for unsupervised outlier detection owing to their interpretability and global characteristics. Among these, Dirichlet process mixture models stand out as a strong alternative to conventional finite mixture models for both clustering and outlier detection tasks. Unlike finite mixture models, Dirichlet process mixtures are infinite mixture models that automatically determine the number of mixture components based on the data. Despite their advantages, the adoption of Dirichlet process mixture models for unsupervised outlier detection has been limited by challenges related to computational inefficiency and sensitivity to outliers in the construction of outlier detectors. Additionally, Dirichlet process Gaussian mixtures struggle to effectively model non-Gaussian data with discrete or binary features. To address these challenges, we propose a novel outlier detection method that utilizes ensembles of Dirichlet process Gaussian mixtures. This unsupervised algorithm employs random subspace and subsampling ensembles to ensure efficient computation and improve the robustness of the outlier detector. The ensemble approach further improves the suitability of the proposed method for detecting outliers in non-Gaussian data. Furthermore, our method uses variational inference for Dirichlet process mixtures, which ensures both efficient and rapid computation. Empirical analyses using benchmark datasets demonstrate that our method outperforms existing approaches in unsupervised outlier detection.
- [431] arXiv:2401.00947 (replaced) [pdf, html, other]
-
Title: On SAT information content, its polynomial-time solvability and fixed code algorithmsComments: 16 pages, 1 table, 0 figures, new content, rewriting arguments, corrected typosSubjects: Computational Complexity (cs.CC); Information Theory (cs.IT)
The amount of information in satisfiability problem (SAT) is considered. SAT can be polynomial-time solvable when the solving algorithm holds an exponential amount of information. It is also established that SAT Kolmogorov complexity is constant. It is argued that the amount of information in SAT grows at least exponentially with the size of the input instance. The amount of information in SAT is compared with the amount of information in the fixed code algorithms and generated over runtime.
- [432] arXiv:2401.04280 (replaced) [pdf, html, other]
-
Title: Predicting the structure of dynamic graphsSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Many aspects of graphs have been studied in depth. However, forecasting the structure of a graph at future time steps incorporating unseen, new nodes and edges has not gained much attention. In this paper, we present such an approach. Using a time series of graphs, we forecast graphs at future time steps. We use time series forecasting methods to predict the node degree at future time points and combine these forecasts with flux balance analysis -- a linear programming method used in biochemistry -- to obtain the structure of future graphs. We evaluate this approach using synthetic and real-world datasets and demonstrate its utility and applicability.
- [433] arXiv:2401.07575 (replaced) [pdf, html, other]
-
Title: Cascaded Cross-Modal Transformer for Audio-Textual ClassificationComments: Accepted for publication in Artificial Intelligence ReviewSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech classification tasks often require powerful language understanding models to grasp useful features, which becomes problematic when limited training data is available. To attain superior classification performance, we propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models and translating the transcripts into different languages via pretrained translation models. We thus obtain an audio-textual (multimodal) representation for each data sample. Subsequently, we combine language-specific Bidirectional Encoder Representations from Transformers (BERT) with Wav2Vec2.0 audio features via a novel cascaded cross-modal transformer (CCMT). Our model is based on two cascaded transformer blocks. The first one combines text-specific features from distinct languages, while the second one combines acoustic features with multilingual features previously learned by the first transformer block. We employed our system in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. CCMT was declared the winning solution, obtaining an unweighted average recall (UAR) of 65.41% and 85.87% for complaint and request detection, respectively. Moreover, we applied our framework on the Speech Commands v2 and HarperValleyBank dialog data sets, surpassing previous studies reporting results on these benchmarks. Our code is freely available for download at: this https URL.
- [434] arXiv:2401.13429 (replaced) [pdf, html, other]
-
Title: Detection of Correlated Random VectorsComments: 42 pagesSubjects: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
In this paper, we investigate the problem of deciding whether two standard normal random vectors $\mathsf{X}\in\mathbb{R}^{n}$ and $\mathsf{Y}\in\mathbb{R}^{n}$ are correlated or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these vectors are statistically independent, while under the alternative, $\mathsf{X}$ and a randomly and uniformly permuted version of $\mathsf{Y}$, are correlated with correlation $\rho$. We analyze the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of $n$ and $\rho$. To derive our information-theoretic lower bounds, we develop a novel technique for evaluating the second moment of the likelihood ratio using an orthogonal polynomials expansion, which among other things, reveals a surprising connection to integer partition functions. We also study a multi-dimensional generalization of the above setting, where rather than two vectors we observe two databases/matrices, and furthermore allow for partial correlations between these two.
- [435] arXiv:2401.14351 (replaced) [pdf, html, other]
-
Title: ServerlessLLM: Low-Latency Serverless Inference for Large Language ModelsComments: 18th USENIX Symposium on Operating Systems Design and ImplementationSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.
- [436] arXiv:2401.15867 (replaced) [pdf, html, other]
-
Title: An Information Aggregation OperatorSubjects: Information Theory (cs.IT)
This study explores a new mathematical operator, symbolized as $\cupplus$, for information aggregation, aimed at enhancing traditional methods by directly amalgamating probability distributions. This operator facilitates the combination of probability densities, contributing a nuanced approach to probabilistic analysis. We apply this operator to a personalized incentive scenario, illustrating its potential in a practical context. The paper's primary contribution lies in introducing this operator and elucidating its elegant mathematical properties. This exploratory work marks a step forward in the field of information fusion and probabilistic reasoning.
- [437] arXiv:2402.00300 (replaced) [pdf, html, other]
-
Title: Self-supervised learning of video representations from a child's perspectiveComments: Published as a conference paper at CogSci 2024; code & models available from this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child's internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.
- [438] arXiv:2402.01279 (replaced) [pdf, html, other]
-
Title: An Improved Viterbi Algorithm for a Class of Optimal Binary Convolutional CodesComments: accepted for ISIT 2024Subjects: Information Theory (cs.IT)
The most famous error-decoding algorithm for convolutional codes is the Viterbi algorithm. In this paper, we present a new reduced complexity version of this algorithm which can be applied to a class of binary convolutional codes with optimum column distances called k-partial simplex convolutional codes.
- [439] arXiv:2402.02750 (replaced) [pdf, html, other]
-
Title: KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV CacheZirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia HuComments: ICML2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at this https URL.
- [440] arXiv:2402.05981 (replaced) [pdf, html, other]
-
Title: Anatomizing Deep Learning Inference in Web BrowsersQipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe LiuComments: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)Subjects: Machine Learning (cs.LG); Performance (cs.PF)
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology
- [441] arXiv:2402.07386 (replaced) [pdf, html, other]
-
Title: Chain-of-Layer: Iteratively Prompting Large Language Models for Taxonomy Induction from Limited ExamplesJournal-ref: Published in CIKM 2024Subjects: Computation and Language (cs.CL)
Automatic taxonomy induction is crucial for web search, recommendation systems, and question answering. Manual curation of taxonomies is expensive in terms of human effort, making automatic taxonomy construction highly desirable. In this work, we introduce Chain-of-Layer which is an in-context learning framework designed to induct taxonomies from a given set of entities. Chain-of-Layer breaks down the task into selecting relevant candidate entities in each layer and gradually building the taxonomy from top to bottom. To minimize errors, we introduce the Ensemble-based Ranking Filter to reduce the hallucinated content generated at each iteration. Through extensive experiments, we demonstrate that Chain-of-Layer achieves state-of-the-art performance on four real-world benchmarks.
- [442] arXiv:2402.09835 (replaced) [pdf, html, other]
-
Title: Parameterized Algorithms for Steiner Forest in Bounded Width GraphsSubjects: Data Structures and Algorithms (cs.DS)
In this paper we reassess the parameterized complexity and approximability of the well-studied Steiner Forest problem in several graph classes of bounded width. The problem takes an edge-weighted graph and pairs of vertices as input, and the aim is to find a minimum cost subgraph in which each given vertex pair lies in the same connected component. It is known that this problem is APX-hard in general, and NP-hard on graphs of treewidth 3, treedepth 4, and feedback vertex set size 2. However, Bateni, Hajiaghayi and Marx [JACM, 2011] gave an approximation scheme with a runtime of $n^{O(\frac{k^2}{\varepsilon})}$ on graphs of treewidth $k$. Our main result is a much faster efficient parameterized approximation scheme (EPAS) with a runtime of $2^{O(\frac{k^2}{\varepsilon} \log \frac{k^2}{\varepsilon})} \cdot n^{O(1)}$. If $k$ instead is the vertex cover number of the input graph, we show how to compute the optimum solution in $2^{O(k \log k)} \cdot n^{O(1)}$ time, and we also prove that this runtime dependence on $k$ is asymptotically best possible, under ETH. Furthermore, if $k$ is the size of a feedback edge set, then we obtain a faster $2^{O(k)} \cdot n^{O(1)}$ time algorithm, which again cannot be improved under ETH.
- [443] arXiv:2402.10626 (replaced) [pdf, html, other]
-
Title: Robust Beamforming for RIS-aided Communications: Gradient-based Manifold Meta LearningFenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane DebbahComments: 12 pages, 13 figures. Accepted by IEEE Transactions on Wireless Communications 2024Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed to the highly non-convex optimization space of variables at both the BS and the RIS, and the diversity of communication environments. Generally, traditional optimization methods for this problem suffer from the high complexity, while existing deep learning based methods are lack of robustness in various scenarios. To address these issues, we introduce a gradient-based manifold meta learning method (GMML), which works without pre-training and has strong robustness for RIS-aided communications. Specifically, the proposed method fuses meta learning and manifold learning to improve the overall spectral efficiency, and reduce the overhead of the high-dimensional signal process. Unlike traditional deep learning based methods which directly take channel state information as input, GMML feeds the gradients of the precoding matrix and phase shifting matrix into neural networks. Coherently, we design a differential regulator to constrain the phase shifting matrix of the RIS. Numerical results show that the proposed GMML can improve the spectral efficiency by up to 7.31\%, and speed up the convergence by 23 times faster compared to traditional approaches. Moreover, they also demonstrate remarkable robustness and adaptability in dynamic settings.
- [444] arXiv:2402.10885 (replaced) [pdf, html, other]
-
Title: 3D Diffuser Actor: Policy Diffusion with 3D Scene RepresentationsComments: First two authors contributed equallySubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene, a language instruction and proprioception to predict the noise in noised 3D robot pose trajectories. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase. It also learns to control a robot manipulator in the real world from a handful of demonstrations. Through thorough comparisons with the current SOTA policies and ablations of our model, we show 3D Diffuser Actor's design choices dramatically outperform 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings.
- [445] arXiv:2402.11933 (replaced) [pdf, html, other]
-
Title: SLADE: Detecting Dynamic Anomalies in Edge Streams without Labels via Self-Supervised LearningComments: 12 pages, 4 figures, To Appear in KDD 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
To detect anomalies in real-world graphs, such as social, email, and financial networks, various approaches have been developed. While they typically assume static input graphs, most real-world graphs grow over time, naturally represented as edge streams. In this context, we aim to achieve three goals: (a) instantly detecting anomalies as they occur, (b) adapting to dynamically changing states, and (c) handling the scarcity of dynamic anomaly labels. In this paper, we propose SLADE (Self-supervised Learning for Anomaly Detection in Edge Streams) for rapid detection of dynamic anomalies in edge streams, without relying on labels. SLADE detects the shifts of nodes into abnormal states by observing deviations in their interaction patterns over time. To this end, it trains a deep neural network to perform two self-supervised tasks: (a) minimizing drift in node representations and (b) generating long-term interaction patterns from short-term ones. Failure in these tasks for a node signals its deviation from the norm. Notably, the neural network and tasks are carefully designed so that all required operations can be performed in constant time (w.r.t. the graph size) in response to each new edge in the input stream. In dynamic anomaly detection across four real-world datasets, SLADE outperforms nine competing methods, even those leveraging label supervision.
- [446] arXiv:2402.12656 (replaced) [pdf, html, other]
-
Title: HyperMoE: Towards Better Mixture of Experts via Transferring Among ExpertsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.
- [447] arXiv:2402.12784 (replaced) [pdf, html, other]
-
Title: Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval SystemsSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
The emergence of Vec2Text -- a method for text embedding inversion -- has raised serious privacy concerns for dense retrieval systems which use text embeddings, such as those offered by OpenAI and Cohere. This threat comes from the ability for a malicious attacker with access to embeddings to reconstruct the original text. In this paper, we investigate various factors related to embedding models that may impact text recoverability via Vec2Text. We explore factors such as distance metrics, pooling functions, bottleneck pre-training, training with noise addition, embedding quantization, and embedding dimensions, which were not considered in the original Vec2Text paper. Through a comprehensive analysis of these factors, our objective is to gain a deeper understanding of the key elements that affect the trade-offs between the text recoverability and retrieval effectiveness of dense retrieval systems, offering insights for practitioners designing privacy-aware dense retrieval systems. We also propose a simple embedding transformation fix that guarantees equal ranking effectiveness while mitigating the recoverability risk. Overall, this study reveals that Vec2Text could pose a threat to current dense retrieval systems, but there are some effective methods to patch such systems.
- [448] arXiv:2402.13055 (replaced) [pdf, html, other]
-
Title: Identifying Semantic Induction Heads to Understand In-Context LearningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Although large language models (LLMs) have demonstrated remarkable performance, the lack of transparency in their inference logic raises concerns about their trustworthiness. To gain a better understanding of LLMs, we conduct a detailed analysis of the operations of attention heads and aim to better understand the in-context learning of LLMs. Specifically, we investigate whether attention heads encode two types of relationships between tokens present in natural languages: the syntactic dependency parsed from sentences and the relation within knowledge graphs. We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens. More crucially, the formulation of such semantic induction heads has a close correlation with the emergence of the in-context learning ability of language models. The study of semantic attention heads advances our understanding of the intricate operations of attention heads in transformers, and further provides new insights into the in-context learning of LLMs.
- [449] arXiv:2402.14566 (replaced) [pdf, html, other]
-
Title: Self-supervised Visualisation of Medical Image DatasetsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Self-supervised learning methods based on data augmentations, such as SimCLR, BYOL, or DINO, allow obtaining semantically meaningful representations of image datasets and are widely used prior to supervised fine-tuning. A recent self-supervised learning method, $t$-SimCNE, uses contrastive learning to directly train a 2D representation suitable for visualisation. When applied to natural image datasets, $t$-SimCNE yields 2D visualisations with semantically meaningful clusters. In this work, we used $t$-SimCNE to visualise medical image datasets, including examples from dermatology, histology, and blood microscopy. We found that increasing the set of data augmentations to include arbitrary rotations improved the results in terms of class separability, compared to data augmentations used for natural images. Our 2D representations show medically relevant structures and can be used to aid data exploration and annotation, improving on common approaches for data visualisation.
- [450] arXiv:2402.16598 (replaced) [pdf, html, other]
-
Title: PCR-99: A Practical Method for Point Cloud Registration with 99% OutliersSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be inliers, and (2) an efficient outlier rejection scheme based on triplet scale consistency, prescreening bad samples and reducing the number of hypotheses to be tested. Our evaluation shows that, up to 98% outlier ratio, the proposed method achieves comparable performance to the state of the art. At 99% outlier ratio, however, it outperforms the state of the art for both known-scale and unknown-scale problems. Especially for the latter, we observe a clear superiority in terms of robustness and speed.
- [451] arXiv:2402.17228 (replaced) [pdf, html, other]
-
Title: Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational PathologyComments: Accepted by CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Multiple instance learning (MIL) is the most widely used framework in computational pathology, encompassing sub-typing, diagnosis, prognosis, and more. However, the existing MIL paradigm typically requires an offline instance feature extractor, such as a pre-trained ResNet or a foundation model. This approach lacks the capability for feature fine-tuning within the specific downstream tasks, limiting its adaptability and performance. To address this issue, we propose a Re-embedded Regional Transformer (R$^2$T) for re-embedding the instance features online, which captures fine-grained local features and establishes connections across different regions. Unlike existing works that focus on pre-training powerful feature extractor or designing sophisticated instance aggregator, R$^2$T is tailored to re-embed instance features online. It serves as a portable module that can seamlessly integrate into mainstream MIL models. Extensive experimental results on common computational pathology tasks validate that: 1) feature re-embedding improves the performance of MIL models based on ResNet-50 features to the level of foundation model features, and further enhances the performance of foundation model features; 2) the R$^2$T can introduce more significant performance improvements to various MIL models; 3) R$^2$T-MIL, as an R$^2$T-enhanced AB-MIL, outperforms other latest methods by a large margin.The code is available at: this https URL.
- [452] arXiv:2403.01614 (replaced) [pdf, other]
-
Title: Real-time optimization of thermoelectric coolers' performance based on energy and exergy analysisComments: 14 pages, 7 figuresSubjects: Systems and Control (eess.SY)
New strategy is presented to optimize the performance of Thermoelectric (TE) coolers. This approach breaks optimizing TE coolers free from traditional methods of controlling temperature or engineering materials and the structural properties of the junctions. We introduced a dimensionless figure, {\gamma}, that shows the ratio of the unavailable cooling capacity to the available cooling capacity. This parameter relates the TE coolers' coefficient of performance (COP) to the COP of the reversible cycle (second law of thermodynamics efficiency) for a given electrical current. The theoretical description of the model is presented, and it is shown that controlling {\gamma} during the TE performance minimizes entropy generation and energy loss, which leads to the maximum pumped heat. We validated this model against a designed TE cooler. In this cooler, contrary to conventional TE coolers, where the temperature of the cold space is generally controlled at a specific temperature, and the performance of the cooler overlooked, the entropy generation and heat loss are engineered, and the electrical current is tuned to minimize {\gamma} by the controller so that the TE cooler works near to its optimum performance at any time.
- [453] arXiv:2403.02255 (replaced) [pdf, html, other]
-
Title: Astronomy in Colombia: a bibliometric perspectiveSofía Guevara-Montoya, Felipe Ortiz-Ferreira, María Paula Silva-Arévalo, Paola A. Niño-Muñoz, Jaime E. Forero-RomeroComments: 33 pages, in Spanish language, 6 figures, 7 tables. Accepted for publication in Revista de la Academia Colombiana de Ciencias Exactas, Físicas y Naturales (RACCEFYN). Texto en españolSubjects: Digital Libraries (cs.DL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
In Colombia, astronomical research is experiencing accelerated growth. To better understand its evolution and current state, we conducted a bibliometric study using data from the Astrophysics Data System (ADS) and the Web of Science (WoS). In the ADS, we identified 422 peer-reviewed publications from 1980, the year of the first publication, until 2023, the cut-off year of the study. Of the 25 Colombian institutions participating in at least one publication, 14 are private and 11 are state institutions. More than half of these institutions are concentrated in two main cities: Bogotá with 11 institutions, followed by Medellín with 3 institutions. The number of contributions from four universities stands out: Universidad de los Andes, Universidad Nacional de Colombia, Universidad Industrial de Santander, and Universidad de Antioquia with 104, 78, 68, and 67 publications, respectively. By cross-referencing the information from the ADS and the WoS, we found that the areas in which publications with the highest impact are found are three: high energies and fundamental physics, stars and stellar physics, and galaxies and cosmology. At the global level, according to the WoS, Colombia ranks 52nd in the number of peer-reviewed publications between 2019 and 2023 and fifth in Latin America. Additionally, we identified three highly cited publications (top 1% worldwide) belonging to the field of observational cosmology.
- [454] arXiv:2403.02737 (replaced) [pdf, html, other]
-
Title: Neural Fractional Differential EquationsSubjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
Fractional Differential Equations (FDEs) are essential tools for modelling complex systems in science and engineering. They extend the traditional concepts of differentiation and integration to non-integer orders, enabling a more precise representation of processes characterised by non-local and memory-dependent behaviours.
This property is useful in systems where variables do not respond to changes instantaneously, but instead exhibit a strong memory of past interactions.
Having this in mind, and drawing inspiration from Neural Ordinary Differential Equations (Neural ODEs), we propose the Neural FDE, a novel deep neural network architecture that adjusts a FDE to the dynamics of data.
This work provides a comprehensive overview of the numerical method employed in Neural FDEs and the Neural FDE architecture. The numerical outcomes suggest that, despite being more computationally demanding, the Neural FDE may outperform the Neural ODE in modelling systems with memory or dependencies on past states, and it can effectively be applied to learn more intricate dynamical systems. - [455] arXiv:2403.05007 (replaced) [pdf, html, other]
-
Title: Age of Computing: A Metric of Computation Freshness in Communication and Computation Cooperative NetworksSubjects: Systems and Control (eess.SY)
In communication and computation cooperative networks (3CNs), timely computation is crucial but not always guaranteed. There is a strong demand for a computational task to be completed within a given deadline. The time taken involves both processing time, communication time, and the impact of the deadline. However, a measure of such timeliness in 3CNs is lacking. In this paper, we introduce the novel concept, Age of Computing (AoC), to capture computation freshness in 3CNs. We analyze AoC in a line topology consisting of a source, a transmitter, a receiver, and a computational node. Tasks generated by the source are immediately available at the transmitter, where they enter a communication queue. These tasks then pass to the receiver and subsequently to a computation queue at the computational node for processing. Each task has a deadline, requiring completion within this timeframe. AoC is evaluated under two types of deadlines: (i) soft deadline, tasks can be fed back to the source if delayed beyond the deadline, but with additional latency; (ii) hard deadline, tasks delayed beyond the deadline are discarded. Under both deadlines, we derive the AoC formula and a general expression for the time-average AoC. For the first-come, first-serve discipline, we obtain a closed-form expression for the average AoC under the soft deadline and an approximation for the hard deadline. In addition to freshness, we define computation throughput, providing a general expression and an approximation. To explore the relationship between freshness and throughput, we construct an optimization problem and prove that the objective pair is a weakly Pareto-optimal point. Numerical results validate all the theoretical findings. Additionally, they reveal that under the hard deadline, the computation throughput serves as a reliable proxy for the average AoC.
- [456] arXiv:2403.06461 (replaced) [pdf, html, other]
-
Title: Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time AdaptationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods for 3D segmentation rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unstable predictions across time. To fulfill this gap, we propose ReLiable Spatial-temporal Voxels (Latte), an MM-TTA method that leverages reliable cross-modal spatial-temporal correspondences for multi-modal 3D segmentation. Motivated by the fact that reliable predictions should be consistent with their spatial-temporal correspondences, Latte aggregates consecutive frames in a slide window manner and constructs Spatial-Temopral (ST) voxels to capture temporally local prediction consistency for each modality. After filtering out ST voxels with high ST entropy, Latte conducts cross-modal learning for each point and pixel by attending to those with reliable and consistent predictions among both spatial and temporal neighborhoods. Experimental results show that Latte achieves state-of-the-art performance on three different MM-TTA benchmarks compared to previous MM-TTA or TTA methods. Visit our project site this https URL.
- [457] arXiv:2403.08651 (replaced) [pdf, html, other]
-
Title: HAIFIT: Human-Centered AI for Fashion Image TranslationComments: 10 pages,7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
In the realm of fashion design, sketches serve as the canvas for expressing an artist's distinctive drawing style and creative vision, capturing intricate details like stroke variations and texture nuances. The advent of sketch-to-image cross-modal translation technology has notably aided designers. However, existing methods often compromise these sketch details during image generation, resulting in images that deviate from the designer's intended concept. This limitation hampers the ability to offer designers a precise preview of the final output. To overcome this challenge, we introduce HAIFIT, a novel approach that transforms sketches into high-fidelity, lifelike clothing images by integrating multi-scale features and capturing extensive feature map dependencies from diverse perspectives. Through extensive qualitative and quantitative evaluations conducted on our self-collected dataset, our method demonstrates superior performance compared to existing methods in generating photorealistic clothing images. Our method excels in preserving the distinctive style and intricate details essential for fashion design applications. In addition, our method also has obvious advantages in model training and inference speed, contributing to reducing designers' time costs and improving design efficiency.
- [458] arXiv:2403.10307 (replaced) [pdf, html, other]
-
Title: Chernoff Information as a Privacy Constraint for Adversarial ClassificationSubjects: Information Theory (cs.IT)
This work inspects a privacy metric based on Chernoff information, \textit{Chernoff differential privacy}, due to its significance in characterization of the optimal classifier's performance. Adversarial classification, as any other classification problem is built around minimization of the (average or correct detection) probability of error in deciding on either of the classes in the case of binary classification. Unlike the classical hypothesis testing problem, where the false alarm and mis-detection probabilities are handled separately resulting in an asymmetric behavior of the best error exponent, in this work, we focus on the Bayesian setting and characterize the relationship between the best error exponent of the average error probability and $\varepsilon\textrm{-}$differential privacy \cite{D06}. Accordingly, we re-derive Chernoff differential privacy in terms of $\varepsilon\textrm{-}$differential privacy using the Radon-Nikodym derivative and show that it satisfies the composition property for sequential composition. Subsequently, we present numerical evaluation results, which demonstrates that Chernoff information outperforms Kullback-Leibler divergence as a function of the privacy parameter $\varepsilon$, the impact of the adversary's attack and global sensitivity for the problem of adversarial classification in Laplace mechanisms.
- [459] arXiv:2403.10444 (replaced) [pdf, html, other]
-
Title: Block Verification Accelerates Speculative DecodingZiteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, Ananda Theertha SureshSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
- [460] arXiv:2403.11761 (replaced) [pdf, html, other]
-
Title: BEVCar: Camera-Radar Fusion for BEV Map and Object SegmentationJonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, Abhinav ValadaComments: Accepted for the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at this http URL.
- [461] arXiv:2403.12856 (replaced) [pdf, html, other]
-
Title: Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path PlanningComments: Accepted at IROS 2024. A video can be found here: this https URL. The code is available at this https URLSubjects: Machine Learning (cs.LG); Robotics (cs.RO)
In reinforcement learning (RL), exploiting environmental symmetries can significantly enhance efficiency, robustness, and performance. However, ensuring that the deep RL policy and value networks are respectively equivariant and invariant to exploit these symmetries is a substantial challenge. Related works try to design networks that are equivariant and invariant by construction, limiting them to a very restricted library of components, which in turn hampers the expressiveness of the networks. This paper proposes a method to construct equivariant policies and invariant value functions without specialized neural network components, which we term equivariant ensembles. We further add a regularization term for adding inductive bias during training. In a map-based path planning case study, we show how equivariant ensembles and regularization benefit sample efficiency and performance.
- [462] arXiv:2403.13129 (replaced) [pdf, html, other]
-
Title: Better Call SAL: Towards Learning to Segment Anything in LidarComments: Accepted to ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
We propose the SAL (Segment Anything in Lidar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for Lidar Panoptic Segmentation (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision ``for free''. Our pseudo-labels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. By training our model on these labels, we distill the 2D foundation models into our Lidar SAL model. Even without manual labels, our model achieves $91\%$ in terms of class-agnostic segmentation and $54\%$ in terms of zero-shot Lidar Panoptic Segmentation of the fully supervised state-of-the-art. Furthermore, we outperform several baselines that do not distill but only lift image features to 3D. More importantly, we demonstrate that SAL supports arbitrary class prompts, can be easily extended to new datasets, and shows significant potential to improve with increasing amounts of self-labeled data. Code and models are available at this $\href{this https URL}{URL}$.
- [463] arXiv:2403.14236 (replaced) [pdf, html, other]
-
Title: A Unified Framework for Model EditingComments: Under review. To appear as poster at KnowledgeableLM Workshop co-located with ACL 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ROME and MEMIT are largely believed to be two different model editing algorithms, with the major difference between them being the ability to perform batched edits. In this paper, we unify these two algorithms under a single conceptual umbrella, optimizing for the same goal, which we call the preservation-memorization objective. ROME uses an equality constraint to optimize this objective to perform one edit at a time, whereas MEMIT employs a more flexible least-square constraint that allows for batched edits. We generalize ROME and enable batched editing with equality constraint in the form of EMMET - an Equality-constrained Mass Model Editing algorithm for Transformers, a new batched memory-editing algorithm. EMMET can perform batched-edits up to a batch-size of 10,000, with very similar performance to MEMIT across multiple dimensions. With the introduction of EMMET, we truly unify ROME and MEMIT and show that both algorithms are equivalent in terms of their optimization objective, their abilities (singular and batched editing), their model editing performance and their limitations.
- [464] arXiv:2403.14888 (replaced) [pdf, html, other]
-
Title: AutoRE: Document-Level Relation Extraction with Large Language ModelsComments: 11 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE). Nonetheless, most existing methods are predominantly designed for Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence. Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges. To overcome these limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03\% and 9.03\% respectively on the dev and test set. The code is available\url{this https URL} and the demonstration video is provided this https URL
- [465] arXiv:2403.15377 (replaced) [pdf, html, other]
-
Title: InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingYi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin WangComments: a technical report about video understanding (accepted to ECCV2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at this https URL.
- [466] arXiv:2403.15634 (replaced) [pdf, html, other]
-
Title: An Interactive Decision-Support Dashboard for Optimal Hospital Capacity ManagementComments: 25 pages, 11 figuresSubjects: Computers and Society (cs.CY)
Data-driven optimization models have the potential to significantly improve hospital capacity management, particularly during demand surges, when effective allocation of capacity is most critical and challenging. However, integrating models into existing processes in a way that provides value requires recognizing that hospital administrators are ultimately responsible for making capacity management decisions, and carefully building trustworthy and accessible tools for them. In this study, we develop an interactive, user-friendly, electronic dashboard for informing hospital capacity management decisions during surge periods. The dashboard integrates real-time hospital data, predictive analytics, and optimization models. It allows hospital administrators to interactively customize parameters, enabling them to explore a range of scenarios, and provides real-time updates on recommended optimal decisions. The dashboard was created through a participatory design process, involving hospital administrators in the development team to ensure practical utility, trustworthiness, transparency, explainability, and usability. We successfully deployed our dashboard within the Johns Hopkins Health System during the height of the COVID-19 pandemic, addressing the increased need for tools to inform hospital capacity management. It was used on a daily basis, with results regularly communicated to hospital leadership. This study demonstrates the practical application of a prospective, data-driven, interactive decision-support tool for hospital system capacity management.
- [467] arXiv:2403.17839 (replaced) [pdf, html, other]
-
Title: ReMamber: Referring Image Segmentation with Mamba TwisterComments: ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Referring Image Segmentation~(RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve competitive results on three challenging benchmarks with a simple and efficient architecture. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code has been released at: this https URL.
- [468] arXiv:2403.19919 (replaced) [pdf, html, other]
-
Title: Diff-Reg v1: Diffusion Matching Model for Registration ProblemComments: arXiv admin note: text overlap with arXiv:2401.00436Subjects: Computer Vision and Pattern Recognition (cs.CV)
Establishing reliable correspondences is essential for registration tasks such as 3D and 2D3D registration. Existing methods commonly leverage geometric or semantic point features to generate potential correspondences. However, these features may face challenges such as large deformation, scale inconsistency, and ambiguous matching problems (e.g., symmetry). Additionally, many previous methods, which rely on single-pass prediction, may struggle with local minima in complex scenarios. To mitigate these challenges, we introduce a diffusion matching model for robust correspondence construction. Our approach treats correspondence estimation as a denoising diffusion process within the doubly stochastic matrix space, which gradually denoises (refines) a doubly stochastic matching matrix to the ground-truth one for high-quality correspondence estimation. It involves a forward diffusion process that gradually introduces Gaussian noise into the ground truth matching matrix and a reverse denoising process that iteratively refines the noisy matching matrix. In particular, the feature extraction from the backbone occurs only once during the inference phase. Our lightweight denoising module utilizes the same feature at each reverse sampling step. Evaluation of our method on both 3D and 2D3D registration tasks confirms its effectiveness. The code is available at this https URL.
- [469] arXiv:2404.00725 (replaced) [pdf, html, other]
-
Title: The Larger the Better? Improved LLM Code-Generation via Budget ReallocationComments: COLM 2024Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model. We consider a standard unit-test setup, which can be used to select the correct output from the smaller model. Our findings reveal that the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.
- [470] arXiv:2404.01039 (replaced) [pdf, html, other]
-
Title: A Survey on Hypergraph Neural Networks: An In-Depth and Step-By-Step GuideComments: To appear in KDD 2024 (survey paper). The typo in Equation (5) has been fixedSubjects: Machine Learning (cs.LG)
Higher-order interactions (HOIs) are ubiquitous in real-world complex systems and applications. Investigation of deep learning for HOIs, thus, has become a valuable agenda for the data mining and machine learning communities. As networks of HOIs are expressed mathematically as hypergraphs, hypergraph neural networks (HNNs) have emerged as a powerful tool for representation learning on hypergraphs. Given the emerging trend, we present the first survey dedicated to HNNs, with an in-depth and step-by-step guide. Broadly, the present survey overviews HNN architectures, training strategies, and applications. First, we break existing HNNs down into four design components: (i) input features, (ii) input structures, (iii) message-passing schemes, and (iv) training strategies. Second, we examine how HNNs address and learn HOIs with each of their components. Third, we overview the recent applications of HNNs in recommendation, bioinformatics and medical science, time series analysis, and computer vision. Lastly, we conclude with a discussion on limitations and future directions.
- [471] arXiv:2404.01407 (replaced) [pdf, html, other]
-
Title: Convergence Acceleration of Favre-Averaged Non-Linear Harmonic MethodSubjects: Numerical Analysis (math.NA); Mathematical Physics (math-ph); Fluid Dynamics (physics.flu-dyn)
This paper develops a numerical procedure to accelerate the convergence of the Favre-averaged Non-Linear Harmonic (FNLH) method. The scheme provides a unified mathematical framework for solving the sparse linear systems formed by the mean flow and the time-linearized harmonic flows of FNLH in an explicit or implicit fashion. The approach explores the similarity of the sparse linear systems of FNLH and leads to a memory efficient procedure, so that its memory consumption does not depend on the number of harmonics to compute. The proposed method has been implemented in the industrial CFD solver HYDRA. Two test cases are used to conduct a comparative study of explicit and implicit schemes in terms of convergence, computational efficiency, and memory consumption. Comparisons show that the implicit scheme yields better convergence than the explicit scheme and is also roughly 7 to 10 times more computationally efficient than the explicit scheme with 4 levels of multigrid. Furthermore, the implicit scheme consumes only approximately $50\%$ of the explicit scheme with four levels of multigrid. Compared with the full annulus unsteady Reynolds averaged Navier-Stokes (URANS) simulations, the implicit scheme produces comparable results to URANS with computational time and memory consumption that are two orders of magnitude smaller.
- [472] arXiv:2404.01488 (replaced) [pdf, html, other]
-
Title: DeLVE into Earth's Past: A Visualization-Based Exhibit Deployed Across Multiple Museum ContextsSubjects: Human-Computer Interaction (cs.HC)
While previous work has found success in deploying visualizations as museum exhibits, it has not investigated whether museum context impacts visitor behaviour with these exhibits. We present an interactive Deep-time Literacy Visualization Exhibit (DeLVE) to help museum visitors understand deep time (lengths of extremely long geological processes) by improving proportional reasoning skills through comparison of different time periods. DeLVE uses a new visualization idiom, Connected Multi-Tier Ranges, to visualize curated datasets of past events across multiple scales of time, relating extreme scales with concrete scales that have more familiar magnitudes and units. Museum staff at three separate museums approved the deployment of DeLVE as a digital kiosk, and devoted time to curating a unique dataset in each of them. We collect data from two sources, an observational study and system trace logs. We discuss the importance of context: similar museum exhibits in different contexts were received very differently by visitors. We additionally discuss differences in our process from Sedlmair et al.'s design study methodology which is focused on design studies triggered by connection with collaborators rather than the discovery of a concept to communicate. Supplemental materials are available at: this https URL
- [473] arXiv:2404.01537 (replaced) [pdf, html, other]
-
Title: Are Doppler Velocity Measurements Useful for Spinning Radar Odometry?Comments: 8 pages, 7 figures, 2 tables, submitted to Robotics and Automation Letters (RA-L)Subjects: Robotics (cs.RO)
Spinning, frequency-modulated continuous-wave (FMCW) radars with 360 degree coverage have been gaining popularity for autonomous-vehicle navigation. However, unlike 'fixed' automotive radar, commercially available spinning radar systems typically do not produce radial velocities due to the lack of repeated measurements in the same direction and the fundamental hardware setup. To make these radial velocities observable, we modified the firmware of a commercial spinning radar to use triangular frequency modulation. In this paper, we develop a novel way to use this modulation to extract radial Doppler velocity measurements from single raw radar intensity scans without any required data association. We show that these noisy, error-prone measurements contain enough information to provide good ego-velocity estimates, and incorporate these estimates into different modern odometry pipelines. We extensively evaluate the pipelines on over 110 km of driving data in progressively more geometrically challenging autonomous-driving environments. We show that Doppler velocity measurements improve odometry in well-defined geometric conditions and enable it to continue functioning even in severely geometrically degenerate environments, such as long tunnels.
- [474] arXiv:2404.01683 (replaced) [pdf, html, other]
-
Title: Guided-Mutation Genetic Algorithm for Mobile IoT Network RelayComments: 15 pages, 8 figures; this work has been accepted for publication in IEEE Access. Accepted 24 July 2024Subjects: Networking and Internet Architecture (cs.NI)
The Internet of Things (IoT) is a communication scheme which allows various objects to exchange several types of information, enabling functions such as home automation, production management, healthcare, etc. In addition, energy-harvesting (EH) technology is considered for IoT environment in order to reduce the need for management and enhance maintainability. Moreover, since environments considering outdoor elements such as pedestrians, vehicles and drones have been on the rise recently, it is important to consider mobility when designing an IoT network management scheme. However, calculating the optimal relaying topology is considered as an NP-hard problem, and finishing computation for mobility environment before the channel status changes is important to prevent delayed calculation results. In this article, our objective is to calculate a sub-optimal relaying topology for stationary and mobile system within reasonable computation time. To achieve our objective, we validate an iterative balancing time slot allocation algorithm introduced in the previous study, and propose a guided-mutation genetic algorithm (GMGA) that modulates the mutation rate based on the channel status for rational exploration. Additionally, we propose a mobility-aware iterative relaying topology algorithm, which calculates relaying topology in a mobility environment using an inheritance of the sub-optimal relaying topology calculations. Simulation results verify that our proposed scheme effectively solves formulated IoT network problems compared to other conventional schemes, and also effectively handles IoT environments including mobility in terms of minimum rate budget and computation time.
- [475] arXiv:2404.01799 (replaced) [pdf, html, other]
-
Title: PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade MathematicsSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs' academic proficiency, often with also an interest in comparing model performance with human test takers. While these benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics - a field dedicated to the measurement of latent variables like academic proficiency - into LLM benchmarking. We make three primary contributions. First, we introduce PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the aforementioned limitations, presenting a new direction for LLM benchmark research. Second, we implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on existing benchmarking practices. Third, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science against human populations.
- [476] arXiv:2404.02831 (replaced) [pdf, html, other]
-
Title: Empowering Biomedical Discovery with AI AgentsShanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, Marinka ZitnikSubjects: Artificial Intelligence (cs.AI)
We envision "AI scientists" as systems capable of skeptical learning and reasoning that empower biomedical research through collaborative agents that integrate AI models and biomedical tools with experimental platforms. Rather than taking humans out of the discovery process, biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks. AI agents are poised to be proficient in various tasks, planning discovery workflows and performing self-assessment to identify and mitigate gaps in their knowledge. These agents use large language models and generative models to feature structured memory for continual learning and use machine learning tools to incorporate scientific knowledge, biological principles, and theories. AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
- [477] arXiv:2404.03613 (replaced) [pdf, html, other]
-
Title: Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian SplattingComments: ECCV 2024. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
As 3D Gaussian Splatting (3DGS) provides fast and high-quality novel view synthesis, it is a natural extension to deform a canonical 3DGS to multiple frames for representing a dynamic scene. However, previous works fail to accurately reconstruct complex dynamic scenes. We attribute the failure to the design of the deformation field, which is built as a coordinate-based function. This approach is problematic because 3DGS is a mixture of multiple fields centered at the Gaussians, not just a single coordinate-based framework. To resolve this problem, we define the deformation as a function of per-Gaussian embeddings and temporal embeddings. Moreover, we decompose deformations as coarse and fine deformations to model slow and fast movements, respectively. Also, we introduce a local smoothness regularization for per-Gaussian embedding to improve the details in dynamic regions. Project page: this https URL
- [478] arXiv:2404.03632 (replaced) [pdf, other]
-
Title: Reference-Based 3D-Aware Image Editing with TriplanesComments: 20 pages, including supplementary materialSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. Additionally, our framework demonstrates versatility and robustness across various domains, extending its effectiveness to animal face edits, partially stylized edits like cartoon faces, full-body clothing edits, and 360-degree head edits. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.
- [479] arXiv:2404.04603 (replaced) [pdf, html, other]
-
Title: Analyzing LLM Usage in an Advanced Computing Class in IndiaAnupam Garg, Aryaman Raina, Aryan Gupta, Jaskaran Singh, Manav Saini, Prachi Iiitd, Ronit Mehta, Rupin Oberoi, Sachin Sharma, Samyak Jain, Sarthak Tyagi, Utkarsh Arora, Dhruv KumarComments: Under review: 8 pagesSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
This study examines the use of large language models (LLMs) by undergraduate and graduate students for programming assignments in advanced computing classes. Unlike existing research, which primarily focuses on introductory classes and lacks in-depth analysis of actual student-LLM interactions, our work fills this gap. We conducted a comprehensive analysis involving 411 students from a Distributed Systems class at an Indian university, where they completed three programming assignments and shared their experiences through Google Form surveys.
Our findings reveal that students leveraged LLMs for a variety of tasks, including code generation, debugging, conceptual inquiries, and test case creation. They employed a spectrum of prompting strategies, ranging from basic contextual prompts to advanced techniques like chain-of-thought prompting and iterative refinement. While students generally viewed LLMs as beneficial for enhancing productivity and learning, we noted a concerning trend of over-reliance, with many students submitting entire assignment descriptions to obtain complete solutions. Given the increasing use of LLMs in the software industry, our study highlights the need to update undergraduate curricula to include training on effective prompting strategies and to raise awareness about the benefits and potential drawbacks of LLM usage in academic settings. - [480] arXiv:2404.05427 (replaced) [pdf, html, other]
-
Title: AutoCodeRover: Autonomous Program ImprovementComments: To appear in ISSTA 2024Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. bug fixing) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM's understanding of the issue's root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on SWE-bench-lite (300 real-life GitHub issues) show increased efficacy in solving GitHub issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported SWE-agent. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.
- [481] arXiv:2404.06903 (replaced) [pdf, html, other]
-
Title: DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian SplattingShijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta KadambiSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: this http URL
- [482] arXiv:2404.08061 (replaced) [pdf, html, other]
-
Title: Physics-Enhanced Graph Neural Networks For Soft Sensing in Industrial Internet of ThingsComments: 14 pages, 10 figures. Accepted to IEEE Internet of Things JournalSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
The Industrial Internet of Things (IIoT) is reshaping manufacturing, industrial processes, and infrastructure management. By fostering new levels of automation, efficiency, and predictive maintenance, IIoT is transforming traditional industries into intelligent, seamlessly interconnected ecosystems. However, achieving highly reliable IIoT can be hindered by factors such as the cost of installing large numbers of sensors, limitations in retrofitting existing systems with sensors, or harsh environmental conditions that may make sensor installation impractical. Soft (virtual) sensing leverages mathematical models to estimate variables from physical sensor data, offering a solution to these challenges. Data-driven and physics-based modeling are the two main methodologies widely used for soft sensing. The choice between these strategies depends on the complexity of the underlying system, with the data-driven approach often being preferred when the physics-based inference models are intricate and present challenges for state estimation. However, conventional deep learning models are typically hindered by their inability to explicitly represent the complex interactions among various sensors. To address this limitation, we adopt Graph Neural Networks (GNNs), renowned for their ability to effectively capture the complex relationships between sensor measurements. In this research, we propose physics-enhanced GNNs, which integrate principles of physics into graph-based methodologies. This is achieved by augmenting additional nodes in the input graph derived from the underlying characteristics of the physical processes. Our evaluation of the proposed methodology on the case study of district heating networks reveals significant improvements over purely data-driven GNNs, even in the presence of noise and parameter inaccuracies.
- [483] arXiv:2404.09302 (replaced) [pdf, html, other]
-
Title: High Significant Fault Detection in Azure Core Workload InsightsComments: Published in IAAI 2024, which is the Industrial track of AAAI 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception.
- [484] arXiv:2404.09556 (replaced) [pdf, html, other]
-
Title: nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image SegmentationFabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus Maier-Hein, Paul F. JaegerComments: Accepted at MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.
- [485] arXiv:2404.10616 (replaced) [pdf, html, other]
-
Title: One is all you need: Second-order Unification without First-order VariablesComments: Under reviewSubjects: Logic in Computer Science (cs.LO)
We introduce a fragment of second-order unification, referred to as \emph{Second-Order Ground Unification (SOGU)}, with the following properties: (i) only one second-order variable is allowed, and (ii) first-order variables do not occur. We study an equational variant of SOGU where the signature contains \textit{associative} binary function symbols (ASOGU) and show that Hilbert's 10$^{th}$ problem is reducible to ASOGU unifiability, thus proving undecidability. Our reduction provides a new lower bound for the undecidability of second-order unification, as previous results required first-order variable occurrences, multiple second-order variables, and/or equational theories involving \textit{length-reducing} rewrite systems. Furthermore, our reduction holds even in the case when associativity of the binary function symbol is restricted to \emph{power associative}, i.e. f(f(x,x),x)= f(x,f(x,x)), as our construction requires a single constant.
- [486] arXiv:2404.10658 (replaced) [pdf, other]
-
Title: Trajectory Planning Using Reinforcement Learning for Interactive Overtaking Maneuvers in Autonomous Racing ScenariosComments: 8 pages, accepted to be published at the 27th IEEE International Conference on Intelligent Transportation Systems, September 24 - 27, 2024, Edmonton, CanadaSubjects: Robotics (cs.RO)
Conventional trajectory planning approaches for autonomous racing are based on the sequential execution of prediction of the opposing vehicles and subsequent trajectory planning for the ego vehicle. If the opposing vehicles do not react to the ego vehicle, they can be predicted accurately. However, if there is interaction between the vehicles, the prediction loses its validity. For high interaction, instead of a planning approach that reacts exclusively to the fixed prediction, a trajectory planning approach is required that incorporates the interaction with the opposing vehicles. This paper demonstrates the limitations of a widely used conventional sampling-based approach within a highly interactive blocking scenario. We show that high success rates are achieved for less aggressive blocking behavior but that the collision rate increases with more significant interaction. We further propose a novel Reinforcement Learning (RL)-based trajectory planning approach for racing that explicitly exploits the interaction with the opposing vehicle without requiring a prediction. In contrast to the conventional approach, the RL-based approach achieves high success rates even for aggressive blocking behavior. Furthermore, we propose a novel safety layer (SL) that intervenes when the trajectory generated by the RL-based approach is infeasible. In that event, the SL generates a sub-optimal but feasible trajectory, avoiding termination of the scenario due to a not found valid solution.
- [487] arXiv:2404.11483 (replaced) [pdf, other]
-
Title: AgentKit: Structured LLM Reasoning with Dynamic GraphsYue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom MitchellSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. The chains of nodes can be designed to explicitly enforce a naturally structured "thought process". For example, for the task of writing a paper, one may start with the thought process of 1) identify a core message, 2) identify prior research gaps, etc. The nodes in AgentKit can be designed and combined in different ways to implement multiple advanced capabilities including on-the-fly hierarchical planning, reflection, and learning from interactions. In addition, due to the modular nature and the intuitive design to simulate explicit human thought process, a basic agent could be implemented as simple as a list of prompts for the subtasks and therefore could be designed and tuned by someone without any programming experience. Quantitatively, we show that agents designed through AgentKit achieve SOTA performance on WebShop and Crafter. These advances underscore AgentKit's potential in making LLM agents effective and accessible for a wider range of applications. this https URL
- [488] arXiv:2404.11869 (replaced) [pdf, html, other]
-
Title: Node-like as a Whole: Structure-aware Searching and Coarsening for Graph ClassificationSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level features? Through experimental analysis, we explore the feasibility of this assumption. Based on our findings, we propose a novel multi-view graph representation learning model via structure-aware searching and coarsening (GRLsc) on GT architecture for graph classification. Specifically, we build three unique views, original, coarsening, and conversion, to learn a thorough structural representation. We compress loops and cliques via hierarchical heuristic graph coarsening and restrict them with well-designed constraints, which builds the coarsening view to learn high-level interactions between structures. We also introduce line graphs for edge embeddings and switch to edge-central perspective to construct the conversion view. Experiments on eight real-world datasets demonstrate the improvements of GRLsc over 28 baselines from various architectures.
- [489] arXiv:2404.12810 (replaced) [pdf, html, other]
-
Title: Enhancing Counterfactual Explanation Search with Diffusion Distance and Directional CoherenceComments: This work has been accepted to be presented to The 2nd World Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19, 2024 - Valletta, MaltaJournal-ref: In: Longo, L., Lapuschkin, S., Seifert, C. (eds) Explainable Artificial Intelligence. xAI 2024. Communications in Computer and Information Science, vol 2155. Springer, ChamSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
A pressing issue in the adoption of AI models is the increasing demand for more human-centric explanations of their predictions. To advance towards more human-centric explanations, understanding how humans produce and select explanations has been beneficial. In this work, inspired by insights of human cognition we propose and test the incorporation of two novel biases to enhance the search for effective counterfactual explanations. Central to our methodology is the application of diffusion distance, which emphasizes data connectivity and actionability in the search for feasible counterfactual explanations. In particular, diffusion distance effectively weights more those points that are more interconnected by numerous short-length paths. This approach brings closely connected points nearer to each other, identifying a feasible path between them. We also introduce a directional coherence term that allows the expression of a preference for the alignment between the joint and marginal directional changes in feature space to reach a counterfactual. This term enables the generation of counterfactual explanations that align with a set of marginal predictions based on expectations of how the outcome of the model varies by changing one feature at a time. We evaluate our method, named Coherent Directional Counterfactual Explainer (CoDiCE), and the impact of the two novel biases against existing methods such as DiCE, FACE, Prototypes, and Growing Spheres. Through a series of ablation experiments on both synthetic and real datasets with continuous and mixed-type features, we demonstrate the effectiveness of our method.
- [490] arXiv:2404.12832 (replaced) [pdf, html, other]
-
Title: COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical imagesComments: This work has been accepted to be presented to The 2nd World Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19, 2024 - Valletta, MaltaJournal-ref: In: Longo, L., Lapuschkin, S., Seifert, C. (eds) Explainable Artificial Intelligence. xAI 2024. Communications in Computer and Information Science, vol 2155. Springer, ChamSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
- [491] arXiv:2404.14082 (replaced) [pdf, html, other]
-
Title: Mechanistic Interpretability for AI Safety -- A ReviewComments: HTML version: this https URLSubjects: Artificial Intelligence (cs.AI)
Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.
- [492] arXiv:2404.18550 (replaced) [pdf, html, other]
-
Title: IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial IntelligenceSubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
The proposed IncidentResponseGPT framework - a novel system that applies generative artificial intelligence (AI) to potentially enhance the efficiency and effectiveness of traffic incident response. This model allows for synthesis of region-specific incident response guidelines and generates incident response plans adapted to specific area, aiming to expedite decision-making for traffic management authorities. This approach aims to accelerate incident resolution times by suggesting various recommendations (e.g. optimal rerouting strategies, estimating resource needs) to minimize the overall impact on the urban traffic network. The system suggests specific actions, including dynamic lane closures, optimized rerouting and dispatching appropriate emergency resources. IncidentResponseGPT employs the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to rank generated response plans based on criteria like impact minimization and resource efficiency based on their proximity to an human-proposed solution.
- [493] arXiv:2404.19708 (replaced) [pdf, html, other]
-
Title: Harmonic LLMs are TrustworthyComments: 15 pages, 2 figures, 16 tables; added Claude-3.0, GPT-4o, Mistral-7B, Mixtral-8x7B, and more annotation for other modelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. To show general application and immediacy of results, we measure $\gamma$ in 10 popular LLMs (ChatGPT, Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B, Mistral-7B and MPT-7B) across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested, human annotation confirms that $\gamma \to 0$ indicates trustworthiness, and conversely searching higher values of $\gamma$ easily exposes examples of hallucination, a fact that enables efficient adversarial prompt generation through stochastic gradient ascent in $\gamma$. The low-$\gamma$ leaders among the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B, providing evidence that mid-size open-source models can win out against large commercial models.
- [494] arXiv:2405.00078 (replaced) [pdf, html, other]
-
Title: VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel ExtensionsComments: RAID'24Subjects: Cryptography and Security (cs.CR); Operating Systems (cs.OS)
High-performance IO demands low-overhead communication between user- and kernel space. This demand can no longer be fulfilled by traditional system calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel transitions by just-in-time compiling user-provided bytecode and executing it in kernel mode with near-native speed. To still isolate BPF programs from the kernel, they are statically analyzed for memory- and type-safety, which imposes some restrictions but allows for good expressiveness and high performance. However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses which reject potentially-dangerous programs had to be deployed. We find that this affects 31% to 54% of programs in a dataset with 844 real-world BPF programs from popular open-source projects. To solve this, users are forced to disable the defenses to continue using the programs, which puts the entire system at risk.
To enable secure and expressive untrusted Linux kernel extensions, we propose VeriFence, an enhancement to the kernel's Spectre defenses that reduces the number of BPF application programs rejected from 54% to zero. We measure VeriFence's overhead for all mainstream performance-sensitive applications of BPF (i.e., event tracing, profiling, and packet processing) and find that it improves significantly upon the status-quo where affected BPF programs are either unusable or enable transient execution attacks on the kernel. - [495] arXiv:2405.00223 (replaced) [pdf, html, other]
-
Title: Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and ExplorationSubjects: Human-Computer Interaction (cs.HC)
Confidence scores of automatic speech recognition (ASR) outputs are often inadequately communicated, preventing its seamless integration into analytical workflows. In this paper, we introduce ConFides, a visual analytic system developed in collaboration with intelligence analysts to address this issue. ConFides aims to aid exploration and post-AI-transcription editing by visually representing the confidence associated with the transcription. We demonstrate how our tool can assist intelligence analysts who use ASR outputs in their analytical and exploratory tasks and how it can help mitigate misinterpretation of crucial information. We also discuss opportunities for improving textual data cleaning and model transparency for human-machine collaboration.
- [496] arXiv:2405.00662 (replaced) [pdf, html, other]
-
Title: No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPOComments: ICML ARLET workshop version. Code and run histories are available at this https URLSubjects: Machine Learning (cs.LG)
Reinforcement learning (RL) is inherently rife with non-stationarity since the states and rewards the agent observes during training depend on its changing policy. Therefore, networks in deep RL must be capable of adapting to new observations and fitting new targets. However, previous works have observed that networks in off-policy deep value-based methods exhibit a decrease in representation rank, often correlated with an inability to continue learning or a collapse in performance. Although this phenomenon has generally been attributed to neural network learning under non-stationarity, it has been overlooked in on-policy policy optimization methods which are often thought capable of training indefinitely. In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and MuJoCo environments, revealing that PPO agents are also affected by feature rank deterioration and loss of plasticity. We show that this is aggravated with stronger non-stationarity, ultimately driving the actor's performance to collapse, regardless of the performance of the critic. We ask why the trust region, specific to methods like PPO, cannot alleviate or prevent the collapse. We find that there is a connection between representation collapse and the degradation of the trust region, one exacerbating the other, and present Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with other interventions, shows that regularizing the representation dynamics improves the performance of PPO agents.
- [497] arXiv:2405.01686 (replaced) [pdf, html, other]
-
Title: Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language ModelsComments: 25 pages, 7 figures, 6 tables, MLHC 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
- [498] arXiv:2405.07406 (replaced) [pdf, html, other]
-
Title: Machine Unlearning: A Comprehensive SurveySubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
As the right to be forgotten has been legislated worldwide, many studies attempt to design unlearning mechanisms to protect users' privacy when they want to leave machine learning service platforms. Specifically, machine unlearning is to make a trained model to remove the contribution of an erased subset of the training dataset. This survey aims to systematically classify a wide range of machine unlearning and discuss their differences, connections and open problems. We categorize current unlearning methods into four scenarios: centralized unlearning, distributed and irregular data unlearning, unlearning verification, and privacy and security issues in unlearning. Since centralized unlearning is the primary domain, we use two parts to introduce: firstly, we classify centralized unlearning into exact unlearning and approximate unlearning; secondly, we offer a detailed introduction to the techniques of these methods. Besides the centralized unlearning, we notice some studies about distributed and irregular data unlearning and introduce federated unlearning and graph unlearning as the two representative directions. After introducing unlearning methods, we review studies about unlearning verification. Moreover, we consider the privacy and security issues essential in machine unlearning and organize the latest related literature. Finally, we discuss the challenges of various unlearning scenarios and address the potential research directions.
- [499] arXiv:2405.07987 (replaced) [pdf, html, other]
-
Title: The Platonic Representation HypothesisSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
- [500] arXiv:2405.09597 (replaced) [pdf, other]
-
Title: When AI Eats Itself: On the Caveats of Data Pollution in the Era of Generative AIXiaodan Xing, Fadong Shi, Jiahao Huang, Yinzhe Wu, Yang Nan, Sheng Zhang, Yingying Fang, Mike Roberts, Carola-Bibiane Schönlieb, Javier Del Ser, Guang YangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes.
Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects?
There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models. - [501] arXiv:2405.14413 (replaced) [pdf, html, other]
-
Title: GeoFaaS: An Edge-to-Cloud FaaS PlatformComments: Accepted for publication in 12th IEEE International Conference on Cloud Engineering (IC2E 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The massive growth of mobile and IoT devices demands geographically distributed computing systems for optimal performance, privacy, and scalability. However, existing edge-to-cloud serverless platforms lack location awareness, resulting in inefficient network usage and increased latency.
In this paper, we propose GeoFaaS, a novel edge-to-cloud Function-as-a-Service (FaaS) platform that leverages real-time client location information for transparent request execution on the nearest available FaaS node. If needed, GeoFaaS transparently offloads requests to the cloud when edge resources are overloaded, thus, ensuring consistent execution without user intervention. GeoFaaS has a modular and decentralized architecture: building on the single-node FaaS system tinyFaaS, GeoFaaS works as a stand-alone edge-to-cloud FaaS platform but can also integrate and act as a routing layer for existing FaaS services, e.g., in the cloud. To evaluate our approach, we implemented an open-source proof-of-concept prototype and studied performance and fault-tolerance behavior in experiments. - [502] arXiv:2405.15461 (replaced) [pdf, html, other]
-
Title: Optimal market-neutral currency trading on the cryptocurrency platformComments: 27 pages, 8 figures, 10 tablesSubjects: Computational Engineering, Finance, and Science (cs.CE); Mathematical Finance (q-fin.MF)
This research proposes a novel arbitrage approach with respect to multivariate pair trading called Optimal Trading Technique (OTT). We introduce the method to selectively form a "bucket" of fiat currencies anchored to cryptocurrency for simultaneously monitoring and exploiting trading opportunities. To handle the quantitative conflicts that arise when receiving multiple trading signals, a bi-objective convex optimization process is designed to cater to the investor's preference between profitability and risk tolerance. This process includes tunable parameters such as volatility punishment, action thresholds. During our experiments in the cryptocurrency market from 2020 to 2022 when the market was experiencing a vigorous bull-run immediately followed by a bear-run, the OTT realized an annualized profit of 15.49%. We further carried out the experiments in bull, bear, and full-cycle market conditions separately, and found that OTT is capable of achieving stable profit under various market conditions. Apart from the profitability side of the OTT, the arbitrage operation provides a new perspective of trading, which requires no external shorting and never hold intermediate cryptocurrency during the arbitrage period.
- [503] arXiv:2405.16639 (replaced) [pdf, html, other]
-
Title: A unified law of robustness for Bregman divergence lossesComments: 18 pagesSubjects: Machine Learning (cs.LG)
In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points $n$, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work that contributes to the considerable research that has been devoted to understand overparameterization, Bubeck and Sellke showed that for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. However, their robustness results were proved only in the setting of regression with square loss. In practice, however many other kinds of losses are used, e.g. cross entropy loss for classification. In this work, we generalize Bubeck and Selke's result to Bregman divergence losses, which form a common generalization of square loss and cross-entropy loss. Our generalization relies on identifying a bias-variance type decomposition that lies at the heart of the proof and Bubeck and Sellke.
- [504] arXiv:2405.16813 (replaced) [pdf, html, other]
-
Title: SiNGR: Brain Tumor Segmentation via Signed Normalized Geodesic Transform RegressionComments: Accepted as a conference paper at MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
One of the primary challenges in brain tumor segmentation arises from the uncertainty of voxels close to tumor boundaries. However, the conventional process of generating ground truth segmentation masks fails to treat such uncertainties properly. Those "hard labels" with 0s and 1s conceptually influenced the majority of prior studies on brain image segmentation. As a result, tumor segmentation is often solved through voxel classification. In this work, we instead view this problem as a voxel-level regression, where the ground truth represents a certainty mapping from any pixel to the border of the tumor. We propose a novel ground truth label transformation, which is based on a signed geodesic transform, to capture the uncertainty in brain tumors' vicinity. We combine this idea with a Focal-like regression L1-loss that enables effective regression learning in high-dimensional output space by appropriately weighting voxels according to their difficulty. We thoroughly conduct an experimental evaluation to validate the components of our proposed method, compare it to a diverse array of state-of-the-art segmentation models, and show that it is architecture-agnostic. The code of our method is made publicly available (\url{this https URL}).
- [505] arXiv:2405.16815 (replaced) [pdf, html, other]
-
Title: Image-level Regression for Uncertainty-aware Retinal Image SegmentationComments: 13 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate retinal vessel (RV) segmentation is a crucial step in the quantitative assessment of retinal vasculature, which is needed for the early detection of retinal diseases and other conditions. Numerous studies have been conducted to tackle the problem of segmenting vessels automatically using a pixel-wise classification approach. The common practice of creating ground truth labels is to categorize pixels as foreground and background. This approach is, however, biased, and it ignores the uncertainty of a human annotator when it comes to annotating e.g. thin vessels. In this work, we propose a simple and effective method that casts the RV segmentation task as an image-level regression. For this purpose, we first introduce a novel Segmentation Annotation Uncertainty-Aware (SAUNA) transform, which adds pixel uncertainty to the ground truth using the pixel's closeness to the annotation boundary and vessel thickness. To train our model with soft labels, we generalize the earlier proposed Jaccard metric loss to arbitrary hypercubes for soft Jaccard index (Intersection-over-Union) optimization. Additionally, we employ a stable version of the Focal-L1 loss for pixel-wise regression. We conduct thorough experiments and compare our method to a diverse set of baselines across 5 retinal image datasets. Our empirical results indicate that the integration of the SAUNA transform and these segmentation losses led to significant performance boosts for different segmentation models. Particularly, our methodology enables UNet-like architectures to substantially outperform computational-intensive baselines. Our implementation is available at \url{this https URL}.
- [506] arXiv:2405.20303 (replaced) [pdf, other]
-
Title: Truthful Budget Aggregation: Beyond Moving-Phantom MechanismsSubjects: Computer Science and Game Theory (cs.GT)
We study a budget-aggregation setting in which a number of voters report their ideal distribution of a budget over a set of alternatives, and a mechanism aggregates these reports into an allocation. Ideally, such mechanisms are truthful, i.e., voters should not be incentivized to misreport their preferences. For the case of two alternatives, the set of mechanisms that are truthful and additionally meet a range of basic desiderata (anonymity, neutrality, and continuity) exactly coincides with the so-called moving-phantom mechanisms, but whether this space is richer for more alternatives was repeatedly stated as an open question. We answer this question in the affirmative by presenting a class of truthful mechanisms that are not moving-phantoms but satisfy the three properties. Since moving-phantom mechanisms can only provide limited fairness guarantees (measured as the worst-case distance to a fair share solution), one motivation for broadening the class of truthful mechanisms is the hope for improved fairness guarantees. We dispel this hope by showing that lower bounds holding for the class of moving-phantom mechanisms extend to all truthful, anonymous, neutral, and continuous mechanisms.
- [507] arXiv:2406.03253 (replaced) [pdf, html, other]
-
Title: Generating Explanations for Cellular Neural NetworksSubjects: Machine Learning (cs.LG)
Recent advancements in graph learning contributed to explaining predictions generated by Graph Neural Networks. However, existing methodologies often fall short when applied to real-world datasets. We introduce HOGE, a framework to capture higher-order structures using cell complexes, which excel at modeling higher-order relationships. In the real world, higher-order structures are ubiquitous like in molecules or social networks, thus our work significantly enhances the practical applicability of graph explanations. HOGE produces clearer and more accurate explanations compared to prior methods. Our method can be integrated with all existing graph explainers, ensuring seamless integration into current frameworks. We evaluate on GraphXAI benchmark datasets, HOGE achieves improved or comparable performance with minimal computational overhead. Ablation studies show that the performance gain observed can be attributed to the higher-order structures that come from introducing cell complexes.
- [508] arXiv:2406.03411 (replaced) [pdf, html, other]
-
Title: Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play ApproachComments: ACL 2024 OralSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at this https URL.
- [509] arXiv:2406.03556 (replaced) [pdf, html, other]
-
Title: Npix2Cpix: A GAN-based Image-to-Image Translation Network with Retrieval-Classification Integration for Watermark Retrieval from Historical Document ImagesJournal-ref: IEEE Access, vol. 12, pp. 95857-95870, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
The identification and restoration of ancient watermarks have long been a major topic in codicology and history. Classifying historical documents based on watermarks is challenging due to their diversity, noisy samples, multiple representation modes, and minor distinctions between classes and intra-class variations. This paper proposes a modified U-net-based conditional generative adversarial network (GAN) named Npix2Cpix to translate noisy raw historical watermarked images into clean, handwriting-free watermarked images by performing image translation from degraded (noisy) pixels to clean pixels. Using image-to-image translation and adversarial learning, the network creates clutter-free images for watermark restoration and categorization. The generator and discriminator of the proposed GAN are trained using two separate loss functions, each based on the distance between images, to learn the mapping from the input noisy image to the output clean image. After using the proposed GAN to pre-process noisy watermarked images, Siamese-based one-shot learning is employed for watermark classification. Experimental results on a large-scale historical watermark dataset demonstrate that cleaning the noisy watermarked images can help to achieve high one-shot classification accuracy. The qualitative and quantitative evaluation of the retrieved watermarked image highlights the effectiveness of the proposed approach.
- [510] arXiv:2406.05981 (replaced) [pdf, html, other]
-
Title: ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less ReparameterizationHaoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine LinSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at this https URL.
- [511] arXiv:2406.07368 (replaced) [pdf, html, other]
-
Title: When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language ModelsComments: Accepted by ICML 2024; 17 pages; 10 figures; 16 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at this https URL.
- [512] arXiv:2406.07709 (replaced) [pdf, html, other]
-
Title: Diagnosing and fixing common problems in Bayesian optimization for molecule designComments: 8 pages, 4 figures. ICML 2024 AI for science workshop (this https URL). Code at: this https URLSubjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)
Bayesian optimization (BO) is a principled approach to molecular design tasks. In this paper we explain three pitfalls of BO which can cause poor empirical performance: an incorrect prior width, over-smoothing, and inadequate acquisition function maximization. We show that with these issues addressed, even a basic BO setup is able to achieve the highest overall performance on the PMO benchmark for molecule design (Gao et al 2022). These results suggest that BO may benefit from more attention in the machine learning for molecules community.
- [513] arXiv:2406.08210 (replaced) [pdf, html, other]
-
Title: Expressivity and Generalization: Fragment-Biases for Molecular GNNsJournal-ref: International Conference on Machine Learning. 2024. OralSubjects: Machine Learning (cs.LG)
Although recent advances in higher-order Graph Neural Networks (GNNs) improve the theoretical expressiveness and molecular property predictive performance, they often fall short of the empirical performance of models that explicitly use fragment information as inductive bias. However, for these approaches, there exists no theoretic expressivity study. In this work, we propose the Fragment-WL test, an extension to the well-known Weisfeiler & Leman (WL) test, which enables the theoretic analysis of these fragment-biased GNNs. Building on the insights gained from the Fragment-WL test, we develop a new GNN architecture and a fragmentation with infinite vocabulary that significantly boosts expressiveness. We show the effectiveness of our model on synthetic and real-world data where we outperform all GNNs on Peptides and have 12% lower error than all GNNs on ZINC and 34% lower error than other fragment-biased models. Furthermore, we show that our model exhibits superior generalization capabilities compared to the latest transformer-based architectures, positioning it as a robust solution for a range of molecular modeling tasks.
- [514] arXiv:2406.09126 (replaced) [pdf, html, other]
-
Title: Auto-Vocabulary Segmentation for LiDAR PointsComments: Accepted by CVPR 2024 OpenSun3D WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV)
Existing perception methods for autonomous driving fall short of recognizing unknown entities not covered in the training data. Open-vocabulary methods offer promising capabilities in detecting any object but are limited by user-specified queries representing target classes. We propose AutoVoc3D, a framework for automatic object class recognition and open-ended segmentation. Evaluation on nuScenes showcases AutoVoc3D's ability to generate precise semantic classes and accurate point-wise segmentation. Moreover, we introduce Text-Point Semantic Similarity, a new metric to assess the semantic similarity between text and point cloud without eliminating novel classes.
- [515] arXiv:2406.09272 (replaced) [pdf, html, other]
-
Title: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric VideosComments: Project page: this https URL. ECCV 2024 camera-ready versionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
- [516] arXiv:2406.12839 (replaced) [pdf, html, other]
-
Title: Evaluating the design space of diffusion-based generative modelsComments: Comments are welcome. Out of admiration we titled our paper after EDM, and hoped theorists' humor is not too cornySubjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
Most existing theoretical investigations of the accuracy of diffusion models, albeit significant, assume the score function has been approximated to a certain accuracy, and then use this a priori bound to control the error of generation. This article instead provides a first quantitative understanding of the whole generation process, i.e., both training and sampling. More precisely, it conducts a non-asymptotic convergence analysis of denoising score matching under gradient descent. In addition, a refined sampling error analysis for variance exploding models is also provided. The combination of these two results yields a full error analysis, which elucidates (again, but this time theoretically) how to design the training and sampling processes for effective generation. For instance, our theory implies a preference toward noise distribution and loss weighting in training that qualitatively agree with the ones used in [Karras et al. 2022]. It also provides perspectives on the choices of time and variance schedules in sampling: when the score is well trained, the design in [Song et al. 2020] is more preferable, but when it is less trained, the design in [Karras et al. 2022] becomes more preferable.
- [517] arXiv:2406.14064 (replaced) [pdf, html, other]
-
Title: PAPR Reduction with Pre-chirp Selection for Affine Frequency Division MultiplexingSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique for high-mobility communications based on discrete affine Fourier transform (DAFT). By properly tuning the pre-chirp parameter and the post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely circumvent path overlap, thereby constituting a full representation of delay-Doppler profile. However, AFDM has a crucial problem of high peak-to-average power ratio (PAPR), stemming from randomness of modulated symbols. In this letter, a novel algorithm named grouped pre-chirp selection (GPS) is proposed to reduce PAPR by strategically varying the pre-chirp parameter across subcarrier groups. Initially, it is established that key AFDM properties are maintained when implementing GPS. Next, we proceed to detail the operational procedures of the GPS algorithm, elucidating its principle for PAPR reduction and emphasizing its computational efficiency advantages. Finally, simulation results employing the complementary cumulative distribution function (CCDF) validate the effectiveness of the proposed GPS in reducing PAPR.
- [518] arXiv:2406.14549 (replaced) [pdf, html, other]
-
Title: Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Frontier AI systems are making transformative impacts across society, but such benefits are not without costs: models trained on web-scale datasets containing personal and private data raise profound concerns about data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage - where the model response reveals pieces of such information - remains inadequately understood. Prior work has investigated what factors drive memorization and have identified that sequence complexity and the number of repetitions drive memorization. Here, we focus on the evolution of memorization over training. We begin by reproducing findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. We next show that sequences which are apparently not memorized after the first encounter can be "uncovered" throughout the course of training even without subsequent encounters, a phenomenon we term "latent memorization". The presence of latent memorization presents a challenge for data privacy as memorized sequences may be hidden at the final checkpoint of the model but remain easily recoverable. To this end, we develop a diagnostic test relying on the cross entropy loss to uncover latent memorized sequences with high accuracy.
- [519] arXiv:2406.16087 (replaced) [pdf, html, other]
-
Title: Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot AutonomyChen Wang, Kaiyi Ji, Junyi Geng, Zhongqiang Ren, Taimeng Fu, Fan Yang, Yifan Guo, Haonan He, Xiangyu Chen, Zitong Zhan, Qiwei Du, Shaoshu Su, Bowen Li, Yuheng Qiu, Yi Du, Qihang Li, Yifan Yang, Xiao Lin, Zhipeng ZhaoSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.
- [520] arXiv:2406.17249 (replaced) [pdf, html, other]
-
Title: SlideSLAM: Sparse, Lightweight, Decentralized Metric-Semantic SLAM for Multi-Robot NavigationXu Liu, Jiuzhou Lei, Ankit Prabhu, Yuezhan Tao, Igor Spasojevic, Pratik Chaudhari, Nikolay Atanasov, Vijay KumarComments: Xu Liu, Jiuzhou Lei, and Ankit Prabhu contributed equally to this work. This is a preliminary release and is subject to improvementSubjects: Robotics (cs.RO)
This paper develops a real-time decentralized metric-semantic Simultaneous Localization and Mapping (SLAM) approach that leverages a sparse and lightweight object-based representation to enable a heterogeneous robot team to autonomously explore 3D environments featuring indoor, urban, and forested areas without relying on GPS. We use a hierarchical metric-semantic representation of the environment, including high-level sparse semantic maps of object models and low-level voxel maps. We leverage the informativeness and viewpoint invariance of the high-level semantic map to obtain an effective semantics-driven place-recognition algorithm for inter-robot loop closure detection across aerial and ground robots with different sensing modalities. A communication module is designed to track each robot's own observations and those of other robots whenever communication links are available. Such observations are then used to construct a merged map. Our framework enables real-time decentralized operations onboard robots, allowing them to opportunistically leverage communication. We integrate and deploy our proposed framework on three types of aerial and ground robots. Extensive experimental results show an average inter-robot localization error of approximately 20 cm in position and 0.2 degrees in orientation, an object mapping F1 score consistently over 0.9, and a communication packet size of merely 2-3 megabytes per kilometer trajectory with as many as 1,000 landmarks. The project website can be found at this https URL.
- [521] arXiv:2406.17987 (replaced) [pdf, html, other]
-
Title: Multi-step Inference over Unstructured DataAditya Kalyanpur, Kailash Karthik Saravanakumar, Victor Barres, CJ McFate, Lori Moon, Nati Seifu, Maksim Eremeev, Jose Barrera, Abraham Bautista-Castillo, Eric Brown, David FerrucciSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora's neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora's superior performance compared to well-known LLM and RAG baselines.
- [522] arXiv:2406.19094 (replaced) [pdf, html, other]
-
Title: Understanding the Security Benefits and Overheads of Emerging Industry Solutions to DRAM Read DisturbanceComments: To appear in DRAMSec 2024Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
We present the first rigorous security, performance, energy, and cost analyses of the state-of-the-art on-DRAM-die read disturbance mitigation method, Per Row Activation Counting (PRAC), described in JEDEC DDR5 specification's April 2024 update. Unlike prior state-of-the-art that advises the memory controller to periodically issue refresh management (RFM) commands, which provides the DRAM chip with time to perform refreshes, PRAC introduces a new back-off signal. PRAC's back-off signal propagates from the DRAM chip to the memory controller and forces the memory controller to 1) stop serving requests and 2) issue RFM commands. As a result, RFM commands are issued when needed as opposed to periodically, reducing RFM's overheads. We analyze PRAC in four steps. First, we define an adversarial access pattern that represents the worst-case for PRAC's security. Second, we investigate PRAC's configurations and security implications. Our analyses show that PRAC can be configured for secure operation as long as no bitflip occurs before accessing a memory location 10 times. Third, we evaluate the performance impact of PRAC and compare it against prior works using Ramulator 2.0. Our analysis shows that while PRAC incurs less than 13% performance overhead for today's DRAM chips, its performance overheads can reach up to 94% for future DRAM chips that are more vulnerable to read disturbance bitflips. Fourth, we define an availability adversarial access pattern that exacerbates PRAC's performance overhead to perform a memory performance attack, demonstrating that such an adversarial pattern can hog up to 94% of DRAM throughput and degrade system throughput by up to 95%. We discuss PRAC's implications on future systems and foreshadow future research directions. To aid future research, we open-source our implementations and scripts at this https URL.
- [523] arXiv:2406.19146 (replaced) [pdf, html, other]
-
Title: Resolving Discrepancies in Compute-Optimal Scaling of Language ModelsComments: Fixing bug in small models with tuned LRSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.
- [524] arXiv:2406.19407 (replaced) [pdf, html, other]
-
Title: YOLOv10 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once (YOLO) SeriesRanjan Sapkota, Rizwan Qureshi, Marco Flores Calero, Chetan Badjugar, Upesh Nepal, Alwin Poulose, Peter Zeno, Uday Bhanu Prakash Vaddevolu, Sheheryar Khan, Maged Shoman, Hong Yan, Manoj KarkeeComments: 11 Figures, 7 TablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
This review systematically examines the progression of the You Only Look Once (YOLO) object detection algorithms from YOLOv1 to the recently unveiled YOLOv10. Employing a reverse chronological analysis, this study examines the advancements introduced by YOLO algorithms, beginning with YOLOv10 and progressing through YOLOv9, YOLOv8, and subsequent versions to explore each version's contributions to enhancing speed, accuracy, and computational efficiency in real-time object detection. The study highlights the transformative impact of YOLO across five critical application areas: automotive safety, healthcare, industrial manufacturing, surveillance, and agriculture. By detailing the incremental technological advancements in subsequent YOLO versions, this review chronicles the evolution of YOLO, and discusses the challenges and limitations in each earlier versions. The evolution signifies a path towards integrating YOLO with multimodal, context-aware, and General Artificial Intelligence (AGI) systems for the next YOLO decade, promising significant implications for future developments in AI-driven applications.
- [525] arXiv:2407.01599 (replaced) [pdf, html, other]
-
Title: JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language ModelsComments: 45 pagesSubjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking--deliberately circumventing the ethical and operational boundaries of LLMs and VLMs--and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: \url{this https URL}.
- [526] arXiv:2407.03543 (replaced) [pdf, html, other]
-
Title: Asymmetric Mempool DoS Security: Formal Definitions and Provable Secure DesignsSubjects: Cryptography and Security (cs.CR)
The mempool plays a crucial role in blockchain systems as a buffer zone for pending transactions before they are executed and included in a block. However, existing works primarily focus on mitigating defenses against already identified real-world attacks. This paper introduces secure blockchain-mempool designs capable of defending against any form of asymmetric eviction DoS attacks. We establish formal security definitions for mempools under the eviction-based attack vector. Our proposed secure transaction admission algorithm, named \textsc{saferAd-CP}, ensures eviction-security by providing a provable lower bound on the cost of executing eviction DoS attacks. Through evaluation with real transaction trace replays, \textsc{saferAd-CP} demonstrates negligible latency and significantly high lower bounds against any eviction attack, highlighting its effectiveness and robustness in securing blockchain mempools.
- [527] arXiv:2407.05623 (replaced) [pdf, html, other]
-
Title: Momentum Auxiliary Network for Supervised Local LearningComments: Accepted by ECCV2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Deep neural networks conventionally employ end-to-end backpropagation for their training process, which lacks biological credibility and triggers a locking dilemma during network parameter updates, leading to significant GPU memory use. Supervised local learning, which segments the network into multiple local blocks updated by independent auxiliary networks. However, these methods cannot replace end-to-end training due to lower accuracy, as gradients only propagate within their local block, creating a lack of information exchange between blocks. To address this issue and establish information transfer across blocks, we propose a Momentum Auxiliary Network (MAN) that establishes a dynamic interaction mechanism. The MAN leverages an exponential moving average (EMA) of the parameters from adjacent local blocks to enhance information flow. This auxiliary network, updated through EMA, helps bridge the informational gap between blocks. Nevertheless, we observe that directly applying EMA parameters has certain limitations due to feature discrepancies among local blocks. To overcome this, we introduce learnable biases, further boosting performance. We have validated our method on four image classification datasets (CIFAR-10, STL-10, SVHN, ImageNet), attaining superior performance and substantial memory savings. Notably, our method can reduce GPU memory usage by more than 45\% on the ImageNet dataset compared to end-to-end training, while achieving higher performance. The Momentum Auxiliary Network thus offers a new perspective for supervised local learning. Our code is available at: this https URL.
- [528] arXiv:2407.05750 (replaced) [pdf, html, other]
-
Title: Large Language Models Understand LayoutSubjects: Computation and Language (cs.CL)
Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
- [529] arXiv:2407.06023 (replaced) [pdf, html, other]
-
Title: Distilling System 2 into System 1Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks that they cannot yet do well.
- [530] arXiv:2407.06581 (replaced) [pdf, html, other]
-
Title: Vision language models are blindSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Claude 3.5 Sonnet performs the best at 74.01% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: this https URL
- [531] arXiv:2407.06863 (replaced) [pdf, html, other]
-
Title: Beyond Aesthetics: Cultural Competence in Text-to-Image ModelsNithish Kannen, Arif Ahmad, Marco Andreetto, Vinodkumar Prabhakaran, Utsav Prabhu, Adji Bousso Dieng, Pushpak Bhattacharyya, Shachi DaveComments: 30 pages, 10 figures, preprintSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of cultural competence. In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and large language models to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models. CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions and along 3 concepts: cuisine, landmarks, and art. CUBE consists of 1) CUBE-1K, a set of high-quality prompts that enable the evaluation of cultural awareness, and 2) CUBE-CSpace, a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity. We also introduce cultural diversity as a novel T2I evaluation component, leveraging quality-weighted Vendi score. Our evaluations reveal significant gaps in the cultural awareness of existing models across countries and provide valuable insights into the cultural diversity of T2I outputs for under-specified prompts. Our methodology is extendable to other cultural regions and concepts, and can facilitate the development of T2I models that better cater to the global population.
- [532] arXiv:2407.06966 (replaced) [pdf, other]
-
Title: Creating Centered Trochoids and Co-Centered Ellipses Through the Uniform Combinations of Rolling and Sliding Motions by Using Virtual Rotating Circles Technique (VRCT)Comments: This paper is subjected in the field of mathematical physics and includes 27 pages, 6 figures and 11 pictures. Also a number of videos are available to show the function of our novel mechanical oscilloscope for creating centered trochoids family and co-centered ellipses (changing eccentricity continuously)Subjects: Graphics (cs.GR)
In this article we present an innovative mental vision for creating uniform rolling and sliding motions with a definite combination for a circle that is moving along another coplanar circle. Also, we introduce two different methods for combining rolling and sliding motions through the VRCT in order to make a simple practical situation for creating centered trochoids and co-centered ellipses. In this article the traditional mathematical perspective for creating centered trochoids (through a solid rule which is based on the pure rolling a circle along another coplanar circle) is violated and changed to a novel mathematical perspective which is based on the combination of uniform rolling and sliding motions of a circle that is moving along another stationary circle! In this new vision we have not to define a centered trochoid as a swept path by an attached point to a pure rolling circle along another circle. Instead, a centered trochoid can be defined as a traced path by a certain point on the circumference of a pure rolling or rolling and sliding circle (with specific combination of rolling and sliding motions) along another coplanar circle. In our mathematical vision the physical concept of polarization plays an important role. Also, through this new mathematical perspective an ellipse can be visualized as a closed plane curve that can be generated through the pure rolling, pure sliding or rolling and sliding motions due to definite combinations of two co-polarized rotational motions with different commensurable angular frequencies! All of the above subjects can be implemented (and are observable) with the help of an innovative device that we have named it mechanical oscilloscope. The function of our invented instrument is independent from any other electronic devices such as computer and does not require programming to plot centered trochoids and co-centered ellipses.
- [533] arXiv:2407.07673 (replaced) [pdf, html, other]
-
Title: Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action LocalizationComments: Accepted to ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.
- [534] arXiv:2407.09503 (replaced) [pdf, html, other]
-
Title: PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric VideosSteven Abreu, Tiffany D. Do, Karan Ahuja, Eric J. Gonzalez, Lee Payne, Daniel McDuff, Mar Gonzalez-FrancoSubjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. To address this gap, we release PARSE-Ego4D, a new set of personal action recommendation annotations for the Ego4D dataset. We take a multi-stage approach to generating and evaluating these annotations. First, we used a prompt-engineered large language model (LLM) to generate context-aware action suggestions and identified over 18,000 action suggestions. While these synthetic action suggestions are valuable, the inherent limitations of LLMs necessitate human evaluation. To ensure high-quality and user-centered recommendations, we conducted a large-scale human annotation study that provides grounding in human preferences for all of PARSE-Ego4D. We analyze the inter-rater agreement and evaluate subjective preferences of participants. Based on our synthetic dataset and complete human annotations, we propose several new tasks for action suggestions based on ego-centric videos. We encourage novel solutions that improve latency and energy requirements. The annotations in PARSE-Ego4D will support researchers and developers who are working on building action recommendation systems for augmented and virtual reality systems.
- [535] arXiv:2407.10172 (replaced) [pdf, html, other]
-
Title: Restoring Images in Adverse Weather Conditions via Histogram TransformerComments: 19 pages, 7 figures, 10MBSubjects: Computer Vision and Pattern Recognition (cs.CV)
Transformer-based image restoration methods in adverse weather have achieved significant progress. Most of them use self-attention along the channel dimension or within spatially fixed-range blocks to reduce computational load. However, such a compromise results in limitations in capturing long-range spatial features. Inspired by the observation that the weather-induced degradation factors mainly cause similar occlusion and brightness, in this work, we propose an efficient Histogram Transformer (Histoformer) for restoring images affected by adverse weather. It is powered by a mechanism dubbed histogram self-attention, which sorts and segments spatial features into intensity-based bins. Self-attention is then applied across bins or within each bin to selectively focus on spatial features of dynamic range and process similar degraded pixels of the long range together. To boost histogram self-attention, we present a dynamic-range convolution enabling conventional convolution to conduct operation over similar pixels rather than neighbor pixels. We also observe that the common pixel-wise losses neglect linear association and correlation between output and ground-truth. Thus, we propose to leverage the Pearson correlation coefficient as a loss function to enforce the recovered pixels following the identical order as ground-truth. Extensive experiments demonstrate the efficacy and superiority of our proposed method. We have released the codes in Github.
- [536] arXiv:2407.10499 (replaced) [pdf, html, other]
-
Title: CIBench: Evaluating Your LLMs with a Code Interpreter PluginSongyang Zhang, Chuyu Zhang, Yingfan Hu, Haowen Shen, Kuikun Liu, Zerun Ma, Fengzhe Zhou, Wenwei Zhang, Xuming He, Dahua Lin, Kai ChenComments: Under review. The first three authors contribute equally, and Songyang Zhang is the project leaderSubjects: Computation and Language (cs.CL)
While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
- [537] arXiv:2407.10959 (replaced) [pdf, other]
-
Title: A unified theory and statistical learning approach for traffic conflict detectionComments: 21 pages, 9 figures, prepared for submissionSubjects: Robotics (cs.RO); Machine Learning (stat.ML)
This study proposes a unified theory and statistical learning approach for traffic conflict detection, addressing the long-existing call for a consistent and comprehensive methodology to evaluate the collision risk emerging in road user interactions. The proposed theory assumes context-dependent probabilistic collision risk and frames conflict detection as assessing this risk by statistical learning of extreme events in daily interactions. Experiments using real-world trajectory data are conducted in this study, where a unified metric of conflict is trained with lane-changing interactions on German highways and applied to near-crash events from the 100-Car Naturalistic Driving Study in the U.S. Results of the experiments demonstrate that the trained metric provides effective collision warnings, generalises across distinct datasets and traffic environments, covers a broad range of conflicts, and delivers a long-tailed distribution of conflict intensity. Reflecting on these results, the unified theory ensures consistent evaluation by a generic formulation that encompasses varying assumptions of traffic conflicts; the statistical learning approach then enables a comprehensive consideration of influencing factors such as motion states of road users, environment conditions, and participant characteristics. Therefore, the theory and learning approach jointly provide an explainable and adaptable methodology for conflict detection among different road users and across various interaction scenarios. This promises to reduce accidents and improve overall traffic safety, by enhanced safety assessment of traffic infrastructures, more effective collision warning systems for autonomous driving, and a deeper understanding of road user behaviour in different traffic conditions.
- [538] arXiv:2407.10979 (replaced) [pdf, html, other]
-
Title: Diffusion Model-based Incentive Mechanism with Prospect Theory for Edge AIGC Services in 6G IoTJinbo Wen, Jiangtian Nie, Yue Zhong, Changyan Yi, Xiaohuan Li, Jiangming Jin, Yang Zhang, Dusit NiyatoSubjects: Networking and Internet Architecture (cs.NI)
The fusion of the Internet of Things (IoT) with Sixth-Generation (6G) technology has significant potential to revolutionize the IoT landscape. With the ultra-reliable and low-latency communication capabilities of 6G, 6G-IoT networks can transmit high-quality and diverse data to enhance edge learning. Artificial Intelligence-Generated Content (AIGC) harnesses advanced AI algorithms to automatically generate various types of content. The emergence of edge AIGC integrates with edge networks, facilitating real-time provision of customized AIGC services by deploying AIGC models on edge devices. However, the current practice of edge devices as AIGC Service Providers (ASPs) lacks incentives, hindering the sustainable provision of high-quality edge AIGC services amidst information asymmetry. In this paper, we develop a user-centric incentive mechanism framework for edge AIGC services in 6G-IoT networks. Specifically, we first propose a contract theory model for incentivizing ASPs to provide AIGC services to clients. Recognizing the irrationality of clients towards personalized AIGC services, we utilize Prospect Theory (PT) to capture their subjective utility better. Furthermore, we adopt the diffusion-based soft actor-critic algorithm to generate the optimal contract design under PT, outperforming traditional deep reinforcement learning algorithms. Our numerical results demonstrate the effectiveness of the proposed scheme.
- [539] arXiv:2407.11055 (replaced) [pdf, html, other]
-
Title: Knowledge boosting during low-latency inferenceComments: Accepted by Interspeech 2024Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8 ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks or 48 ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications. Code, dataset, and audio samples available at this https URL.
- [540] arXiv:2407.11358 (replaced) [pdf, html, other]
-
Title: SES: Bridging the Gap Between Explainability and Prediction of Graph Neural NetworksComments: Accepted as a conference paper at ICDE 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Despite the Graph Neural Networks' (GNNs) proficiency in analyzing graph data, achieving high-accuracy and interpretable predictions remains challenging. Existing GNN interpreters typically provide post-hoc explanations disjointed from GNNs' predictions, resulting in misrepresentations. Self-explainable GNNs offer built-in explanations during the training process. However, they cannot exploit the explanatory outcomes to augment prediction performance, and they fail to provide high-quality explanations of node features and require additional processes to generate explainable subgraphs, which is costly. To address the aforementioned limitations, we propose a self-explained and self-supervised graph neural network (SES) to bridge the gap between explainability and prediction. SES comprises two processes: explainable training and enhanced predictive learning. During explainable training, SES employs a global mask generator co-trained with a graph encoder and directly produces crucial structure and feature masks, reducing time consumption and providing node feature and subgraph explanations. In the enhanced predictive learning phase, mask-based positive-negative pairs are constructed utilizing the explanations to compute a triplet loss and enhance the node representations by contrastive learning.
- [541] arXiv:2407.11467 (replaced) [pdf, html, other]
-
Title: TEXasGAN: Tactile Texture Exploration and Synthesis System Using Generative Adversarial NetworkSubjects: Human-Computer Interaction (cs.HC)
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop vibration generation model based on user preferences. To enable users to easily control the generation of vibration samples with large parameter spaces, we introduce an optimization model based on Differential Subspace Search (DSS) and Generative Adversarial Network (GAN). With DSS, users can use a one-dimensional slider to easily modify the high-dimensional latent space so that the GAN can generate desired vibrations. We trained the generative model using a open dataset of tactile vibration data and selected five types of vibrations as target samples for the generation experiment. Extensive user experiments were conducted using the generated and real samples. The results indicate that our system can generate distinguishable samples that match the target characteristics. Moreover, the results also reveal a correlation between subjects' ability to distinguish real samples and their ability to distinguish generated samples.
- [542] arXiv:2407.11652 (replaced) [pdf, other]
-
Title: CCVA-FL: Cross-Client Variations Adaptive Federated Learning for Medical ImagingComments: 10 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Federated Learning (FL) offers a privacy-preserving approach to train models on decentralized data. Its potential in healthcare is significant, but challenges arise due to cross-client variations in medical image data, exacerbated by limited annotations. This paper introduces Cross-Client Variations Adaptive Federated Learning (CCVA-FL) to address these issues. CCVA-FL aims to minimize cross-client variations by transforming images into a common feature space. It involves expert annotation of a subset of images from each client, followed by the selection of a client with the least data complexity as the target. Synthetic medical images are then generated using Scalable Diffusion Models with Transformers (DiT) based on the target client's annotated images. These synthetic images, capturing diversity and representing the original data, are shared with other clients. Each client then translates its local images into the target image space using image-to-image translation. The translated images are subsequently used in a federated learning setting to develop a server model. Our results demonstrate that CCVA-FL outperforms Vanilla Federated Averaging by effectively addressing data distribution differences across clients without compromising privacy.
- [543] arXiv:2407.11686 (replaced) [pdf, html, other]
-
Title: CCoE: A Compact LLM with Collaboration of ExpertsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In the domain of Large Language Model (LLM), LLMs demonstrate significant capabilities in natural language understanding and generation. With the growing needs of applying LLMs on various domains, it is a research question that how to efficiently train and build a model that has expertise in different domains but with a low training cost. We propose CCoE architecture, a framework of easily coupling multiple strong domain experts together to fuse into a big LLM, provides a collective way of utilizing the different domain expert LLMs. Besides, training a large collaborative of multiple expert LLMs requires a high requirements on training sources. CCoE bypasses this problem through isolating other experts and train each expert separately. The design of CCoE assembles multiple expert LLMs through the CoE (Collaboration of Experts) layer. Each CoE layer could have one or more expert LLMs. Expert LLMs have different number of layers and have been well-trained for different domain tasks. Each expert is fine-tuned to be able to achieve the comparable results with SOTA domain LLMs. We start from 5 experts in the domain of Code, Math, Law, text-to-SQL and Medical. The results indicate that our CCoE framework can easily and efficiently boost nearly 10%-20% performance on original base model in different domains but using less resources on training, as well as inference.
- [544] arXiv:2407.12066 (replaced) [pdf, html, other]
-
Title: Temporally Grounding Instructional Diagrams in Unconstrained VideosJiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez, Yizhak Ben-Shabat, Anoop Cherian, Stephen GouldSubjects: Computer Vision and Pattern Recognition (cs.CV)
We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
- [545] arXiv:2407.12197 (replaced) [pdf, html, other]
-
Title: Towards Interpretable Visuo-Tactile Predictive Models for Soft Robot InteractionsComments: IEEE RAS EMBS 10th International Conference on Biomedical Robotics and Biomechatronics (BioRob 2024)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Autonomous systems face the intricate challenge of navigating unpredictable environments and interacting with external objects. The successful integration of robotic agents into real-world situations hinges on their perception capabilities, which involve amalgamating world models and predictive skills. Effective perception models build upon the fusion of various sensory modalities to probe the surroundings. Deep learning applied to raw sensory modalities offers a viable option. However, learning-based perceptive representations become difficult to interpret. This challenge is particularly pronounced in soft robots, where the compliance of structures and materials makes prediction even harder. Our work addresses this complexity by harnessing a generative model to construct a multi-modal perception model for soft robots and to leverage proprioceptive and visual information to anticipate and interpret contact interactions with external objects. A suite of tools to interpret the perception model is furnished, shedding light on the fusion and prediction processes across multiple sensory inputs after the learning phase. We will delve into the outlooks of the perception model and its implications for control purposes.
- [546] arXiv:2407.12212 (replaced) [pdf, html, other]
-
Title: Generalized Coverage for More Robust Low-Budget Active LearningComments: Accepted to ECCV2024Subjects: Machine Learning (cs.LG)
The ProbCover method of Yehuda et al. is a well-motivated algorithm for active learning in low-budget regimes, which attempts to "cover" the data distribution with balls of a given radius at selected data points. We demonstrate, however, that the performance of this algorithm is extremely sensitive to the choice of this radius hyper-parameter, and that tuning it is quite difficult, with the original heuristic frequently failing. We thus introduce (and theoretically motivate) a generalized notion of "coverage," including ProbCover's objective as a special case, but also allowing smoother notions that are far more robust to hyper-parameter choice. We propose an efficient greedy method to optimize this coverage, generalizing ProbCover's algorithm; due to its close connection to kernel herding, we call it "MaxHerding." The objective can also be optimized non-greedily through a variant of $k$-medoids, clarifying the relationship to other low-budget active learning methods. In comprehensive experiments, MaxHerding surpasses existing active learning methods across multiple low-budget image classification benchmarks, and does so with less computational cost than most competitive methods.
- [547] arXiv:2407.12736 (replaced) [pdf, html, other]
-
Title: CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer InferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
- [548] arXiv:2407.12835 (replaced) [pdf, html, other]
-
Title: Regurgitative Training: The Value of Real Data in Training Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered training process where high-quality data are added before low-quality ones. Second, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). Third, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data.
- [549] arXiv:2407.13675 (replaced) [pdf, html, other]
-
Title: MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture SynthesisComments: The paper was accepted by ECCV2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present MeshSegmenter, a simple yet effective framework designed for zero-shot 3D semantic segmentation. This model successfully extends the powerful capabilities of 2D segmentation models to 3D meshes, delivering accurate 3D segmentation across diverse meshes and segment descriptions. Specifically, our model leverages the Segment Anything Model (SAM) model to segment the target regions from images rendered from the 3D shape. In light of the importance of the texture for segmentation, we also leverage the pretrained stable diffusion model to generate images with textures from 3D shape, and leverage SAM to segment the target regions from images with textures. Textures supplement the shape for segmentation and facilitate accurate 3D segmentation even in geometrically non-prominent areas, such as segmenting a car door within a car mesh. To achieve the 3D segments, we render 2D images from different views and conduct segmentation for both textured and untextured images. Lastly, we develop a multi-view revoting scheme that integrates 2D segmentation results and confidence scores from various views onto the 3D mesh, ensuring the 3D consistency of segmentation results and eliminating inaccuracies from specific perspectives. Through these innovations, MeshSegmenter offers stable and reliable 3D segmentation results both quantitatively and qualitatively, highlighting its potential as a transformative tool in the field of 3D zero-shot segmentation. The code is available at \url{this https URL}.
- [550] arXiv:2407.13759 (replaced) [pdf, html, other]
-
Title: Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionComments: *Equal Contributions; Fixed few duplicated references from 1st upload; Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at this https URL.
- [551] arXiv:2407.13842 (replaced) [pdf, html, other]
-
Title: Language-Driven 6-DoF Grasp Detection Using Negative Prompt GuidanceComments: Accepted at ECCV 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications. Our project is available at this https URL.
- [552] arXiv:2407.14207 (replaced) [pdf, html, other]
-
Title: Longhorn: State Space Models are Amortized Online LearnersSubjects: Machine Learning (cs.LG)
The most fundamental capability of modern AI methods such as Large Language Models (LLMs) is the ability to predict the next token in a long sequence of tokens, known as ``sequence modeling." Although the Transformers model is the current dominant approach to sequence modeling, its quadratic computational cost with respect to sequence length is a significant drawback. State-space models (SSMs) offer a promising alternative due to their linear decoding efficiency and high parallelizability during training. However, existing SSMs often rely on seemingly ad hoc linear recurrence designs. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from optimizing these objectives. Based on this insight, we introduce a novel deep SSM architecture based on the implicit update for optimizing an online regression objective. Our experimental results show that our models outperform state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks and language modeling tasks.
- [553] arXiv:2407.14242 (replaced) [pdf, html, other]
-
Title: Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing ImagesComments: Accepted in ACMMM 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13\% relative improvement on panoptic quality.
- [554] arXiv:2407.15267 (replaced) [pdf, html, other]
-
Title: A Learning-Based Attack Framework to Break SOTA Poisoning Defenses in Federated LearningYuxin Yang (1 and 2), Qiang Li (1), Chenfei Nie (1), Yuan Hong (3), Meng Pang (4), Binghui Wang (2) ((1) College of Computer Science and Technology, Jilin University, (2) Illinois Institute of Technology, (3) University of Connecticut, (4) Nanchang University)Comments: This is an extended version of our CIKM 2024 paperSubjects: Cryptography and Security (cs.CR)
Federated Learning (FL) is a novel client-server distributed learning framework that can protect data privacy. However, recent works show that FL is vulnerable to poisoning attacks. Many defenses with robust aggregators (AGRs) are proposed to mitigate the issue, but they are all broken by advanced attacks. Very recently, some renewed robust AGRs are designed, typically with novel clipping or/and filtering strate-gies, and they show promising defense performance against the advanced poisoning attacks. In this paper, we show that these novel robust AGRs are also vulnerable to carefully designed poisoning attacks. Specifically, we observe that breaking these robust AGRs reduces to bypassing the clipping or/and filtering of malicious clients, and propose an optimization-based attack framework to leverage this observation. Under the framework, we then design the customized attack against each robust AGR. Extensive experiments on multiple datasets and threat models verify our proposed optimization-based attack can break the SOTA AGRs. We hence call for novel defenses against poisoning attacks to FL. Code is available at: this https URL BreakSTOAPoisoningDefenses.
- [555] arXiv:2407.15396 (replaced) [pdf, html, other]
-
Title: Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph GenerationComments: ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The scene graph generation (SGG) task involves detecting objects within an image and predicting predicates that represent the relationships between the objects. However, in SGG benchmark datasets, each subject-object pair is annotated with a single predicate even though a single predicate may exhibit diverse semantics (i.e., semantic diversity), existing SGG models are trained to predict the one and only predicate for each pair. This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate, thus leading to biased predictions. In this paper, we propose a novel model-agnostic Semantic Diversity-aware Prototype-based Learning (DPL) framework that enables unbiased predictions based on the understanding of the semantic diversity of predicates. Specifically, DPL learns the regions in the semantic space covered by each predicate to distinguish among the various different semantics that a single predicate can represent. Extensive experiments demonstrate that our proposed model-agnostic DPL framework brings significant performance improvement on existing SGG models, and also effectively understands the semantic diversity of predicates.
- [556] arXiv:2407.15612 (replaced) [pdf, other]
-
Title: Can GPT-4 learn to analyze moves in research article abstracts?Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
One of the most powerful and enduring ideas in written discourse analysis is that genres can be described in terms of the moves which structure a writer's purpose. Considerable research has sought to identify these distinct communicative acts, but analyses have been beset by problems of subjectivity, reliability and the time-consuming need for multiple coders to confirm analyses. In this paper we employ the affordances of GPT-4 to automate the annotation process by using natural language prompts. Focusing on abstracts from articles in four applied linguistics journals, we devise prompts which enable the model to identify moves effectively. The annotated outputs of these prompts were evaluated by two assessors with a third addressing disagreements. The results show that an 8-shot prompt was more effective than one using two, confirming that the inclusion of examples illustrating areas of variability can enhance GPT-4's ability to recognize multiple moves in a single sentence and reduce bias related to textual position. We suggest that GPT-4 offers considerable potential in automating this annotation process, when human actors with domain specific linguistic expertise inform the prompting process.
- [557] arXiv:2407.15706 (replaced) [pdf, html, other]
-
Title: Multi-Modality Co-Learning for Efficient Skeleton-based Action RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: this https URL.
- [558] arXiv:2407.15894 (replaced) [pdf, html, other]
-
Title: Craft: Cross-modal Aligned Features Improve Robustness of Prompt TuningComments: 15pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training samples. In this paper, we propose a Cross-modal Aligned Feature Tuning (Craft) method to address this issue. Cross-modal alignment is conducted by first selecting anchors from the alternative domain and deriving relative representations of the embeddings for the selected anchors. Optimizing for a feature alignment loss over anchor-aligned text and image modalities creates a more unified text-image common space. Overfitting in prompt tuning also deteriorates model performance on out-of-distribution samples. To further improve the prompt model's robustness, we propose minimizing Maximum Mean Discrepancy (MMD) over the anchor-aligned feature spaces to mitigate domain shift. The experiment on four different prompt tuning structures consistently shows the improvement of our method, with increases of up to $6.1\%$ in the Base-to-Novel generalization task, $5.8\%$ in the group robustness task, and $2.7\%$ in the out-of-distribution tasks. The code will be available at this https URL
- [559] arXiv:2407.15899 (replaced) [pdf, html, other]
-
Title: Spatial-Temporal Cross-View Contrastive Pre-training for Check-in Sequence Representation LearningLetian Gong, Huaiyu Wan, Shengnan Guo, Xiucheng Li, Yan Lin, Erwen Zheng, Tianyi Wang, Zeyu Zhou, Youfang LinComments: This paper has been accepted as a regular paper at IEEE TKDESubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rapid growth of location-based services (LBS) has yielded massive amounts of data on human mobility. Effectively extracting meaningful representations for user-generated check-in sequences is pivotal for facilitating various downstream services. However, the user-generated check-in data are simultaneously influenced by the surrounding objective circumstances and the user's subjective intention. Specifically, the temporal uncertainty and spatial diversity exhibited in check-in data make it difficult to capture the macroscopic spatial-temporal patterns of users and to understand the semantics of user mobility activities. Furthermore, the distinct characteristics of the temporal and spatial information in check-in sequences call for an effective fusion method to incorporate these two types of information. In this paper, we propose a novel Spatial-Temporal Cross-view Contrastive Representation (STCCR) framework for check-in sequence representation learning. Specifically, STCCR addresses the above challenges by employing self-supervision from "spatial topic" and "temporal intention" views, facilitating effective fusion of spatial and temporal information at the semantic level. Besides, STCCR leverages contrastive clustering to uncover users' shared spatial topics from diverse mobility activities, while employing angular momentum contrast to mitigate the impact of temporal uncertainty and noise. We extensively evaluate STCCR on three real-world datasets and demonstrate its superior performance across three downstream tasks.
- [560] arXiv:2407.15900 (replaced) [pdf, html, other]
-
Title: Improving probabilistic forecasts of extreme wind speeds by training statistical post-processing models with weighted scoring rulesSubjects: Machine Learning (cs.LG); Applications (stat.AP)
Accurate forecasts of extreme wind speeds are of high importance for many applications. Such forecasts are usually generated by ensembles of numerical weather prediction (NWP) models, which however can be biased and have errors in dispersion, thus necessitating the application of statistical post-processing techniques. In this work we aim to improve statistical post-processing models for probabilistic predictions of extreme wind speeds. We do this by adjusting the training procedure used to fit ensemble model output statistics (EMOS) models - a commonly applied post-processing technique - and propose estimating parameters using the so-called threshold-weighted continuous ranked probability score (twCRPS), a proper scoring rule that places special emphasis on predictions over a threshold. We show that training using the twCRPS leads to improved extreme event performance of post-processing models for a variety of thresholds. We find a distribution body-tail trade-off where improved performance for probabilistic predictions of extreme events comes with worse performance for predictions of the distribution body. However, we introduce strategies to mitigate this trade-off based on weighted training and linear pooling. Finally, we consider some synthetic experiments to explain the training impact of the twCRPS and derive closed-form expressions of the twCRPS for a number of distributions, giving the first such collection in the literature. The results will enable researchers and practitioners alike to improve the performance of probabilistic forecasting models for extremes and other events of interest.
- [561] arXiv:2407.16445 (replaced) [pdf, html, other]
-
Title: Can time series forecasting be automated? A benchmark and analysisSubjects: Machine Learning (cs.LG)
In the field of machine learning and artificial intelligence, time series forecasting plays a pivotal role across various domains such as finance, healthcare, and weather. However, the task of selecting the most suitable forecasting method for a given dataset is a complex task due to the diversity of data patterns and characteristics. This research aims to address this challenge by proposing a comprehensive benchmark for evaluating and ranking time series forecasting methods across a wide range of datasets. This study investigates the comparative performance of many methods from two prominent time series forecasting frameworks, AutoGluon-Timeseries, and sktime to shed light on their applicability in different real-world scenarios. This research contributes to the field of time series forecasting by providing a robust benchmarking methodology and facilitating informed decision-making when choosing forecasting methods for achieving optimal prediction.
- [562] arXiv:2407.16470 (replaced) [pdf, html, other]
-
Title: Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language ModelsKenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ifeoluwa Adelani, Eduardo SánchezComments: Authors Kenza Benkirane and Laura Gongas contributed equally to this workSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.
- [563] arXiv:2407.16607 (replaced) [pdf, html, other]
-
Title: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?Comments: 19 pages, 5 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
- [564] arXiv:2407.16911 (replaced) [pdf, other]
-
Title: Certified simultaneous isotopic approximation of curves via subdivisionComments: This paper was intended to be a new version of the paper arXiv:2302.04908Subjects: Computational Geometry (cs.CG); Algebraic Geometry (math.AG)
We present a certified algorithm based on subdivision for computing an isotopic approximation to any number of curves in the plane. Our algorithm is based on the certified curve approximation algorithm of Plantinga and Vegter. The main challenge in this algorithm is to correctly and efficiently identify and isolate all intersections between the curves. To overcome this challenge, we introduce a new and simple test that guarantees the global correctness of our output. A main step in our algorithm for approximating any number of curves is to correctly approximate a pair of curves. In addition to developing the details of this special case, we provide complexity analyses for both the number of steps and the bit-complexity of this algorithm using both worst-case bounds as well as those based on continuous amortization.
- [565] arXiv:2407.16958 (replaced) [pdf, html, other]
-
Title: Cheems: Wonderful Matrices More Efficient and More Effective ArchitectureSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. I studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. I found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.
- [566] arXiv:2407.17028 (replaced) [pdf, other]
-
Title: Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside WasteSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Environmental monitoring of lakeside green areas is crucial for environmental protection. Compared to manual inspections, computer vision technologies offer a more efficient solution when deployed on-site. Multispectral imaging provides diverse information about objects under different spectrums, aiding in the differentiation between waste and lakeside lawn environments. This study introduces WasteMS, the first multispectral dataset established for the semantic segmentation of lakeside waste. WasteMS includes a diverse range of waste types in lawn environments, captured under various lighting conditions. We implemented a rigorous annotation process to label waste in images. Representative semantic segmentation frameworks were used to evaluate segmentation accuracy using WasteMS. Challenges encountered when using WasteMS for segmenting waste on lakeside lawns were discussed. The WasteMS dataset is available at this https URL.
- [567] arXiv:2407.17053 (replaced) [pdf, html, other]
-
Title: Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++Comments: Accepted as a full paper in the technical track at The International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024Subjects: Software Engineering (cs.SE)
Background: The C and C++ languages hold significant importance in Software Engineering research because of their widespread use in practice. Numerous studies have utilized Machine Learning (ML) and Deep Learning (DL) techniques to detect software vulnerabilities (SVs) in the source code written in these languages. However, the application of these techniques in function-level SV assessment has been largely unexplored. SV assessment is increasingly crucial as it provides detailed information on the exploitability, impacts, and severity of security defects, thereby aiding in their prioritization and remediation. Aims: We conduct the first empirical study to investigate and compare the performance of ML and DL models, many of which have been used for SV detection, for function-level SV assessment in C/C++. Method: Using 9,993 vulnerable C/C++ functions, we evaluated the performance of six multi-class ML models and five multi-class DL models for the SV assessment at the function level based on the Common Vulnerability Scoring System (CVSS). We further explore multi-task learning, which can leverage common vulnerable code to predict all SV assessment outputs simultaneously in a single model, and compare the effectiveness and efficiency of this model type with those of the original multi-class models. Results: We show that ML has matching or even better performance compared to the multi-class DL models for function-level SV assessment with significantly less training time. Employing multi-task learning allows the DL models to perform significantly better, with an average of 8-22% increase in Matthews Correlation Coefficient (MCC). Conclusions: We distill the practices of using data-driven techniques for function-level SV assessment in C/C++, including the use of multi-task DL to balance efficiency and effectiveness. This can establish a strong foundation for future work in this area.
- [568] arXiv:2407.17075 (replaced) [pdf, html, other]
-
Title: SAFETY-J: Evaluating Safety with CritiqueSubjects: Computation and Language (cs.CL)
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at \url{this https URL}.
- [569] arXiv:2407.17092 (replaced) [pdf, html, other]
-
Title: Universal Approximation of Dynamical Systems by Semi-Autonomous Neural ODEs and ApplicationsComments: 22 pages, 16 pagesSubjects: Numerical Analysis (math.NA)
In this paper, we introduce semi-autonomous neural ordinary differential equations (SA-NODEs), a variation of the vanilla NODEs, employing fewer parameters. We investigate the universal approximation properties of SA-NODEs for dynamical systems from both a theoretical and a numerical perspective. Within the assumption of a finite-time horizon, under general hypotheses we establish an asymptotic approximation result, demonstrating that the error vanishes as the number of parameters goes to infinity. Under additional regularity assumptions, we further specify this convergence rate in relation to the number of parameters, utilizing quantitative approximation results in the Barron space. Based on the previous result, we prove an approximation rate for transport equations by their neural counterparts. Our numerical experiments validate the effectiveness of SA-NODEs in capturing the dynamics of various ODE systems and transport equations. Additionally, we compare SA-NODEs with vanilla NODEs, highlighting the superior performance and reduced complexity of our approach.
- [570] arXiv:2407.17125 (replaced) [pdf, html, other]
-
Title: Behavioral Testing: Can Large Language Models Implicitly Resolve Ambiguous Entities?Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. In this paper, we focus on entity type ambiguity and analyze current state-of-the-art LLMs for their proficiency and consistency in applying their factual knowledge when prompted for entities under ambiguity. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 entities. Our experiments reveal that LLMs perform poorly with ambiguous prompts, achieving only 80% accuracy. Our results further demonstrate systematic discrepancies in LLM behavior and their failure to consistently apply information, indicating that the models can exhibit knowledge without being able to utilize it, significant biases for preferred readings, as well as self inconsistencies. Our study highlights the importance of handling entity ambiguity in future for more trustworthy LLMs
- [571] arXiv:2407.17170 (replaced) [pdf, html, other]
-
Title: Domain Generalized Recaptured Screen Image Identification Using SWIN TransformerComments: 11 pages, 10 figures, 9 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
An increasing number of classification approaches have been developed to address the issue of image rebroadcast and recapturing, a standard attack strategy in insurance frauds, face spoofing, and video piracy. However, most of them neglected scale variations and domain generalization scenarios, performing poorly in instances involving domain shifts, typically made worse by inter-domain and cross-domain scale variances. To overcome these issues, we propose a cascaded data augmentation and SWIN transformer domain generalization framework (DAST-DG) in the current research work Initially, we examine the disparity in dataset representation. A feature generator is trained to make authentic images from various domains indistinguishable. This process is then applied to recaptured images, creating a dual adversarial learning setup. Extensive experiments demonstrate that our approach is practical and surpasses state-of-the-art methods across different databases. Our model achieves an accuracy of approximately 82\% with a precision of 95\% on high-variance datasets.
- [572] arXiv:2407.17229 (replaced) [pdf, html, other]
-
Title: LPGen: Enhancing High-Fidelity Landscape Painting Generation through Diffusion ModelSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generating landscape paintings expands the possibilities of artistic creativity and imagination. Traditional landscape painting methods involve using ink or colored ink on rice paper, which requires substantial time and effort. These methods are susceptible to errors and inconsistencies and lack precise control over lines and colors. This paper presents LPGen, a high-fidelity, controllable model for landscape painting generation, introducing a novel multi-modal framework that integrates image prompts into the diffusion model. We extract its edges and contours by computing canny edges from the target landscape image. These, along with natural language text prompts and drawing style references, are fed into the latent diffusion model as conditions. We implement a decoupled cross-attention strategy to ensure compatibility between image and text prompts, facilitating multi-modal image generation. A decoder generates the final image. Quantitative and qualitative analyses demonstrate that our method outperforms existing approaches in landscape painting generation and exceeds the current state-of-the-art. The LPGen network effectively controls the composition and color of landscape paintings, generates more accurate images, and supports further research in deep learning-based landscape painting generation.
- [573] arXiv:2407.17258 (replaced) [pdf, html, other]
-
Title: A New Scalar Auxiliary Variable Approach for Gradient FlowsSubjects: Numerical Analysis (math.NA)
The scalar auxiliary variable (SAV) approach is a highly efficient method widely used for solving gradient flow systems. This approach offers several advantages, including linearity, unconditional energy stability, and ease of implementation. By introducing scalar auxiliary variables, a modified system that is equivalent to the original system is constructed at the continuous level. However, during temporal discretization, computational errors can lead to a loss of equivalence and accuracy. In this paper, we introduce a new Constant Scalar Auxiliary Variable (CSAV) approach in which we derive an Ordinary Differential Equation (ODE) for the constant scalar auxiliary variable r. We also introduce a stabilization parameter ({\alpha}) to improve the stability of the scheme by slowing down the dynamics of r. The CSAV approach provides additional benefits as well. We explicitly discretize the auxiliary variable in combination with the nonlinear term, enabling the solution of a single linear system with constant coefficients at each time step. This new approach also eliminates the need for assumptions about the free energy potential, removing the bounded-from-below restriction imposed by the nonlinear free energy potential in the original SAV approach. Finally, we validate the proposed method through extensive numerical simulations, demonstrating its effectiveness and accuracy.
- [574] arXiv:2407.17276 (replaced) [pdf, html, other]
-
Title: SoK: Bridging Trust into the Blockchain. A Systematic Review on On-Chain IdentitySubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
The ongoing regulation of blockchain-based services and applications requires the identification of users who are issuing transactions on the blockchain. This systematic review explores the current status, identifies research gaps, and outlines future research directions for establishing trusted and privacy-compliant identities on the blockchain (on-chain identity). A systematic search term was applied across various scientific databases, collecting 2232 potentially relevant research papers. These papers were narrowed down in two methodologically executed steps to 98 and finally to 13 relevant sources. The relevant articles were then systematically analyzed based on a set of screening questions. The results of the selected studies have provided insightful findings on the mechanisms of on-chain identities. On-chain identities are established using zero-knowledge proofs, public key infrastructure/certificates, and web of trust approaches. The technologies and architectures used by the authors are also highlighted. Trust has emerged as a key research gap, manifesting in two ways: firstly, a gap in how to trust the digital identity representation of a physical human; secondly, a gap in how to trust identity providers that issue identity confirmations on-chain. Potential future research avenues are suggested to help fill the current gaps in establishing trust and on-chain identities.
- [575] arXiv:2407.17310 (replaced) [pdf, html, other]
-
Title: LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV)
The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.
- [576] arXiv:2407.17449 (replaced) [pdf, html, other]
-
Title: Looking at Model Debiasing through the Lens of Anomaly DetectionComments: 15 pages, 7 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
It is widely recognized that deep neural networks are sensitive to bias in the data. This means that during training these models are likely to learn spurious correlations between data and labels, resulting in limited generalization abilities and low performance. In this context, model debiasing approaches can be devised aiming at reducing the model's dependency on such unwanted correlations, either leveraging the knowledge of bias information or not. In this work, we focus on the latter and more realistic scenario, showing the importance of accurately predicting the bias-conflicting and bias-aligned samples to obtain compelling performance in bias mitigation. On this ground, we propose to conceive the problem of model bias from an out-of-distribution perspective, introducing a new bias identification method based on anomaly detection. We claim that when data is mostly biased, bias-conflicting samples can be regarded as outliers with respect to the bias-aligned distribution in the feature space of a biased model, thus allowing for precisely detecting them with an anomaly detection method. Coupling the proposed bias identification approach with bias-conflicting data upsampling and augmentation in a two-step strategy, we reach state-of-the-art performance on synthetic and real benchmark datasets. Ultimately, our proposed approach shows that the data bias issue does not necessarily require complex debiasing methods, given that an accurate bias identification procedure is defined.
- [577] arXiv:1810.13431 (replaced) [pdf, html, other]
-
Title: Targeted stochastic gradient Markov chain Monte Carlo for hidden Markov models with rare latent statesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler. This makes them computationally slow as the length of the time series increases, motivating the development of sub-sampling-based approaches. These approximate the full posterior by using small random subsequences of the data at each MCMC iteration within stochastic gradient MCMC. In the presence of imbalanced data resulting from rare latent states, subsequences often exclude rare latent state data, leading to inaccurate inference and prediction/detection of rare events. We propose a targeted sub-sampling (TASS) approach that over-samples observations corresponding to rare latent states when calculating the stochastic gradient of parameters associated with them. TASS uses an initial clustering of the data to construct subsequence weights that reduce the variance in gradient estimation. This leads to improved sampling efficiency, in particular in settings where the rare latent states correspond to extreme observations. We demonstrate substantial gains in predictive and inferential accuracy on real and synthetic examples.
- [578] arXiv:2108.08158 (replaced) [pdf, html, other]
-
Title: Practical X-ray Gastric Cancer Screening Using Refined Stochastic Data Augmentation and Hard Boundary Box TrainingComments: 19 pages, 6 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Endoscopy is widely used to diagnose gastric cancer and has a high diagnostic performance, but because it must be performed by a physician, the number of people who can be diagnosed is limited. Gastric X-ray, on the other hand, can be performed by technicians and can screen a much larger number of patients than endoscopy, but its correct diagnosis requires experience. We propose an unprecedented and practical gastric cancer diagnosis support system for gastric X-ray images, which will enable more people to be screened. The system is based on a general deep learning-based object detection model and includes two novel technical proposals: refined probabilistic stomach image augmentation (R-sGAIA) and hard boundary box learning (HBBT). R-sGAIA is a probabilistic gastric fold region enhancement method that provides more learning patterns for cancer detection models. HBBT is an efficient training method for object detection models that allows the use of unannotated negative (i.e., healthy control) samples that cannot be used for training in conventional detection models, thereby improving model performance. The sensitivity (SE) of the proposed system for gastric cancer (90.2%) is higher than that of the expert (85.5%), and two out of five candidates detected box are cancerous, achieving a high precision while maintaining a high processing speed of 0.51 seconds/image. The proposed system showed 5.9 points higher on the F1 score compared to methods using the same object detection model and state-of-the-art data augmentation. In short, the system quickly and efficiently shows the radiologist where to look, greatly reducing the radiologist's workload.
- [579] arXiv:2201.07794 (replaced) [pdf, html, other]
-
Title: A Non-Expert's Introduction to Data Ethics for MathematiciansComments: A few more small tweaks. This is a book chapter. It is associated with my data-ethics lecture at the 2021 AMS Short Course on Mathematical and Computational Methods for Complex Social SystemsSubjects: History and Overview (math.HO); Computers and Society (cs.CY); Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Machine Learning (stat.ML)
I give a short introduction to data ethics. I begin with some background information and societal context for data ethics. I then discuss data ethics in mathematical-science education and indicate some available course material. I briefly highlight a few efforts -- at my home institution and elsewhere -- on data ethics, society, and social good. I then discuss open data in research, research replicability and some other ethical issues in research, and the tension between privacy and open data and code, and a few controversial studies and reactions to studies. I then discuss ethical principles, institutional review boards, and a few other considerations in the scientific use of human data. I then briefly survey a variety of research and lay articles that are relevant to data ethics and data privacy. I conclude with a brief summary and some closing remarks.
My focal audience is mathematicians, but I hope that this chapter will also be useful to others. I am not an expert about data ethics, and this chapter provides only a starting point on this wide-ranging topic. I encourage you to examine the resources that I discuss and to reflect carefully on data ethics, its role in mathematics education, and the societal implications of data and data analysis. As data and technology continue to evolve, I hope that such careful reflection will continue throughout your life. - [580] arXiv:2202.08728 (replaced) [pdf, html, other]
-
Title: Nonparametric extensions of randomized response for private confidence setsComments: 50 pages, 7 figures, to appear in the 2023 International Conference on Machine Learning with an Oral PresentationSubjects: Methodology (stat.ME); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations $(X_1, \dots, X_n)$ with mean $\mu^\star$ that are privatized into $(Z_1, \dots, Z_n)$, we present confidence intervals (CI) and time-uniform confidence sequences (CS) for $\mu^\star$ when only given access to the privatized data. To achieve this, we study a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.
- [581] arXiv:2210.10729 (replaced) [pdf, html, other]
-
Title: Ruminations on Matrix Convexity and the Strong Subadditivity of Quantum EntropyComments: The published version of this paper includes a local mistake in Eq. (5.7). Posted here are: i) the corresponding Erratum & Addendum (LMP 2024), and ii) a corrected version of the paper with the minor adjustments spelled in (i)Subjects: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Mathematical Physics (math-ph)
The familiar second derivative test for convexity, combined with resolvent calculus, is shown to yield a useful tool for the study of convex matrix-valued functions. We demonstrate the applicability of this approach on a number of theorems in this field. These include convexity principles which play an essential role in the Lieb-Ruskai proof of the strong subadditivity of quantum entropy.
- [582] arXiv:2302.13536 (replaced) [pdf, html, other]
-
Title: Natural Gradient Hybrid Variational Inference with Application to Deep Mixed ModelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Stochastic models with global parameters and latent variables are common, and for which variational inference (VI) is popular. However, existing methods are often either slow or inaccurate in high dimensions. We suggest a fast and accurate VI method for this case that employs a well-defined natural gradient variational optimization that targets the joint posterior of the global parameters and latent variables. It is a hybrid method, where at each step the global parameters are updated using the natural gradient and the latent variables are generated from their conditional posterior. A fast to compute expression for the Tikhonov damped Fisher information matrix is used, along with the re-parameterization trick, to provide a stable natural gradient. We apply the approach to deep mixed models, which are an emerging class of Bayesian neural networks with random output layer coefficients to allow for heterogeneity. A range of simulations show that using the natural gradient is substantially more efficient than using the ordinary gradient, and that the approach is faster and more accurate than two cutting-edge natural gradient VI methods. In a financial application we show that accounting for industry level heterogeneity using the deep mixed model improves the accuracy of asset pricing models. MATLAB code to implement the method can be found at: this https URL.
- [583] arXiv:2303.05679 (replaced) [pdf, other]
-
Title: Clustering with minimum spanning trees: How good can it be?Journal-ref: Journal of Classification, 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.
- [584] arXiv:2305.14275 (replaced) [pdf, html, other]
-
Title: Variational Inference with Coverage Guarantees in Simulation-Based InferenceSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
Amortized variational inference is an often employed framework in simulation-based inference that produces a posterior approximation that can be rapidly computed given any new observation. Unfortunately, there are few guarantees about the quality of these approximate posteriors. We propose Conformalized Amortized Neural Variational Inference (CANVI), a procedure that is scalable, easily implemented, and provides guaranteed marginal coverage. Given a collection of candidate amortized posterior approximators, CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. CANVI ensures that the resulting predictor constructs regions that contain the truth with a user-specified level of probability. CANVI is agnostic to design decisions in formulating the candidate approximators and only requires access to samples from the forward model, permitting its use in likelihood-free settings. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation. Finally, we demonstrate the accurate calibration and high predictive efficiency of CANVI on a suite of simulation-based inference benchmark tasks and an important scientific task: analyzing galaxy emission spectra.
- [585] arXiv:2306.14596 (replaced) [pdf, html, other]
-
Title: Deep Learning for Cancer Prognosis Prediction Using Portrait Photos by StyleGAN EmbeddingAmr Hagag, Ahmed Gomaa, Dominik Kornek, Andreas Maier, Rainer Fietkau, Christoph Bert, Florian Putz, Yixing HuangComments: MICCAI 2024 Early AcceptSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Survival prediction for cancer patients is critical for optimal treatment selection and patient management. Current patient survival prediction methods typically extract survival information from patients' clinical record data or biological and imaging data. In practice, experienced clinicians can have a preliminary assessment of patients' health status based on patients' observable physical appearances, which are mainly facial features. However, such assessment is highly subjective. In this work, the efficacy of objectively capturing and using prognostic information contained in conventional portrait photographs using deep learning for survival predication purposes is investigated for the first time. A pre-trained StyleGAN2 model is fine-tuned on a custom dataset of our cancer patients' photos to empower its generator with generative ability suitable for patients' photos. The StyleGAN2 is then used to embed the photographs to its highly expressive latent space. Utilizing the state-of-the-art survival analysis models and based on StyleGAN's latent space photo embeddings, this approach achieved a C-index of 0.677, which is notably higher than chance and evidencing the prognostic value embedded in simple 2D facial images. In addition, thanks to StyleGAN's interpretable latent space, our survival prediction model can be validated for relying on essential facial features, eliminating any biases from extraneous information like clothing or background. Moreover, a health attribute is obtained from regression coefficients, which has important potential value for patient care.
- [586] arXiv:2306.17557 (replaced) [pdf, html, other]
-
Title: The most likely common causeComments: added some clarifications, removed typosSubjects: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The common cause principle for two random variables $A$ and $B$ is examined in the case of causal insufficiency, when their common cause $C$ is known to exist, but only the joint probability of $A$ and $B$ is observed. As a result, $C$ cannot be uniquely identified (the latent confounder problem). We show that the generalized maximum likelihood method can be applied to this situation and allows identification of $C$ that is consistent with the common cause principle. It closely relates to the maximum entropy principle. Investigation of the two binary symmetric variables reveals a non-analytic behavior of conditional probabilities reminiscent of a second-order phase transition. This occurs during the transition from correlation to anti-correlation in the observed probability distribution. The relation between the generalized likelihood approach and alternative methods, such as predictive likelihood and the minimum common cause entropy, is discussed. The consideration of the common cause for three observed variables (and one hidden cause) uncovers causal structures that defy representation through directed acyclic graphs with the Markov condition.
- [587] arXiv:2308.01582 (replaced) [pdf, html, other]
-
Title: Quantum speedups for stochastic optimizationSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
We consider the problem of minimizing a continuous function given quantum access to a stochastic gradient oracle. We provide two new methods for the special case of minimizing a Lipschitz convex function. Each method obtains a dimension versus accuracy trade-off which is provably unachievable classically and we prove that one method is asymptotically optimal in low-dimensional settings. Additionally, we provide quantum algorithms for computing a critical point of a smooth non-convex function at rates not known to be achievable classically. To obtain these results we build upon the quantum multivariate mean estimation result of Cornelissen et al. 2022 and provide a general quantum-variance reduction technique of independent interest.
- [588] arXiv:2309.06679 (replaced) [pdf, html, other]
-
Title: Robust experimental data assimilation for the Spalart-Allmaras turbulence modelSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
This study presents a methodology focusing on the use of computational model and experimental data fusion to improve the Spalart-Allmaras (SA) closure model for Reynolds-averaged Navier-Stokes solutions. In particular, our goal is to develop a technique that not only assimilates sparse experimental data to improve turbulence model performance, but also preserves generalization for unseen cases by recovering classical SA behavior. We achieve our goals using data assimilation, namely the Ensemble Kalman filtering approach (EnKF), to calibrate the coefficients of the SA model for separated flows. A holistic calibration strategy is implemented via the parameterization of the production, diffusion, and destruction terms. This calibration relies on the assimilation of experimental data collected in the form of velocity profiles, skin friction, and pressure coefficients. Despite using observational data from a single flow condition around a backward-facing step (BFS), the recalibrated SA model demonstrates generalization to other separated flows, including cases such as the 2D NASA wall mounted hump (2D-WMH) and modified BFS. Significant improvement is observed in the quantities of interest, i.e., skin friction coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested. Finally, it is also demonstrated that the newly proposed model recovers SA proficiency for flows, such as a NACA-0012 airfoil and axisymmetric jet (ASJ), and that the individually calibrated terms in the SA model target specific flow-physics wherein the calibrated production term improves the re-circulation zone while destruction improves the recovery zone.
- [589] arXiv:2310.09149 (replaced) [pdf, html, other]
-
Title: Wasserstein approximation schemes based on Voronoi partitionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We consider structured approximation of measures in Wasserstein space $\mathrm{W}_p(\mathbb{R}^d)$ for $p\in[1,\infty)$ using general measure approximants compactly supported on Voronoi regions derived from a scaled Voronoi partition of $\mathbb{R}^d$. We show that if a full rank lattice $\Lambda$ is scaled by a factor of $h\in(0,1]$, then approximation of a measure based on the Voronoi partition of $h\Lambda$ is $O(h)$ regardless of $d$ or $p$. We then use a covering argument to show that $N$-term approximations of compactly supported measures is $O(N^{-\frac1d})$ which matches known rates for optimal quantizers and empirical measure approximation in most instances. Additionally, we generalize our construction to nonuniform Voronoi partitions, highlighting the flexibility and robustness of our approach for various measure approximation scenarios. Finally, we extend these results to noncompactly supported measures with sufficient decay. Our findings are pertinent to applications in computer vision and machine learning where measures are used to represent structured data such as images.
- [590] arXiv:2310.10619 (replaced) [pdf, html, other]
-
Title: Shortest-path recovery from signature with an optimal control approachComments: 34 pages, 5 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Numerical Analysis (math.NA)
In this paper, we consider the signature-to-path reconstruction problem from the control theoretic perspective. Namely, we design an optimal control problem whose solution leads to the minimal-length path that generates a given signature. In order to do that, we minimize a cost functional consisting of two competing terms, i.e., a weighted final-time cost combined with the $L^2$-norm squared of the controls. Moreover, we can show that, by taking the limit to infinity of the parameter that tunes the final-time cost, the problem $\Gamma$-converges to the problem of finding a sub-Riemannian geodesic connecting two signatures. Finally, we provide an alternative reformulation of the latter problem, which is particularly suitable for the numerical implementation.
- [591] arXiv:2311.01940 (replaced) [pdf, html, other]
-
Title: Balanced independent sets and colorings of hypergraphsComments: 14 pagesSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
A $k$-uniform hypergraph $H = (V, E)$ is $k$-partite if $V$ can be partitioned into $k$ sets $V_1, \ldots, V_k$ such that every edge in $E$ contains precisely one vertex from each $V_i$. We call such a graph $n$-balanced if $|V_i| = n$ for each $i$. An independent set $I$ in $H$ is balanced if $|I\cap V_i| = |I\cap V_j|$ for each $1 \leq i, j \leq k$, and a coloring is balanced if each color class induces a balanced independent set in $H$. In this paper, we provide a lower bound on the balanced independence number $\alpha_b(H)$ in terms of the average degree $D = |E|/n$, and an upper bound on the balanced chromatic number $\chi_b(H)$ in terms of the maximum degree $\Delta$. Our results recover those of recent work of Chakraborti for $k = 2$.
- [592] arXiv:2402.01089 (replaced) [pdf, html, other]
-
Title: No Free Prune: Information-Theoretic Barriers to Pruning at InitializationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.
- [593] arXiv:2402.02290 (replaced) [pdf, html, other]
-
Title: Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and PythonComments: 36 pages, 9 figuresSubjects: Computation (stat.CO); Machine Learning (cs.LG); Mathematical Software (cs.MS); Applications (stat.AP); Machine Learning (stat.ML)
We introduce the QuadratiK package that incorporates innovative data analysis methodologies. The presented software, implemented in both R and Python, offers a comprehensive set of goodness-of-fit tests and clustering techniques using kernel-based quadratic distances, thereby bridging the gap between the statistical and machine learning literatures. Our software implements one, two and k-sample tests for goodness of fit, providing an efficient and mathematically sound way to assess the fit of probability distributions. Expanded capabilities of our software include supporting tests for uniformity on the d-dimensional Sphere based on Poisson kernel densities. Particularly noteworthy is the incorporation of a unique clustering algorithm specifically tailored for spherical data that leverages a mixture of Poisson kernel-based densities on the sphere. Alongside this, our software includes additional graphical functions, aiding the users in validating, as well as visualizing and representing clustering results. This enhances interpretability and usability of the analysis. In summary, our R and Python packages serve as a powerful suite of tools, offering researchers and practitioners the means to delve deeper into their data, draw robust inference, and conduct potentially impactful analyses and inference across a wide array of disciplines.
- [594] arXiv:2402.15534 (replaced) [pdf, html, other]
-
Title: DiCoM -- Diverse Concept Modeling towards Enhancing Generalizability in Chest X-Ray StudiesAbhijeet Parida, Daniel Capellan-Martin, Sara Atito, Muhammad Awais, Maria J. Ledesma-Carbayo, Marius G. Linguraru, Syed Muhammad AnwarSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Chest X-Ray (CXR) is a widely used clinical imaging modality and has a pivotal role in the diagnosis and prognosis of various lung and heart related conditions. Conventional automated clinical diagnostic tool design strategies relying on radiology reads and supervised learning, entail the cumbersome requirement of high quality annotated training data. To address this challenge, self-supervised pre-training has proven to outperform supervised pre-training in numerous downstream vision tasks, representing a significant breakthrough in the field. However, medical imaging pre-training significantly differs from pre-training with natural images (e.g., ImageNet) due to unique attributes of clinical images. In this context, we introduce Diverse Concept Modeling (DiCoM), a novel self-supervised training paradigm that leverages a student teacher framework for learning diverse concepts and hence effective representation of the CXR data. Hence, expanding beyond merely modeling a single primary label within an image, instead, effectively harnessing the information from all the concepts inherent in the CXR. The pre-trained model is subsequently fine-tuned to address diverse domain-specific tasks. Our proposed paradigm consistently demonstrates robust performance across multiple downstream tasks on multiple datasets, highlighting the success and generalizability of the pre-training strategy. To establish the efficacy of our methods we analyze both the power of learned representations and the speed of convergence (SoC) of our models. For diverse data and tasks, DiCoM is able to achieve in most cases better results compared to other state-of-the-art pre-training strategies. This when combined with the higher SoC and generalization capabilities positions DiCoM to be established as a foundation model for CXRs, a widely used imaging modality.
- [595] arXiv:2402.18729 (replaced) [pdf, html, other]
-
Title: A Priori Uncertainty Quantification of Reacting Turbulence Closure Models using Bayesian Neural NetworksSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
While many physics-based closure model forms have been posited for the sub-filter scale (SFS) in large eddy simulation (LES), vast amounts of data available from direct numerical simulation (DNS) create opportunities to leverage data-driven modeling techniques. Albeit flexible, data-driven models still depend on the dataset and the functional form of the model chosen. Increased adoption of such models requires reliable uncertainty estimates both in the data-informed and out-of-distribution regimes. In this work, we employ Bayesian neural networks (BNNs) to capture both epistemic and aleatoric uncertainties in a reacting flow model. In particular, we model the filtered progress variable scalar dissipation rate which plays a key role in the dynamics of turbulent premixed flames. We demonstrate that BNN models can provide unique insights about the structure of uncertainty of the data-driven closure models. We also propose a method for the incorporation of out-of-distribution information in a BNN. The efficacy of the model is demonstrated by a priori evaluation on a dataset consisting of a variety of flame conditions and fuels.
- [596] arXiv:2403.02307 (replaced) [pdf, html, other]
-
Title: Harnessing Intra-group Variations Via a Population-Level Context for Pathology DetectionP. Bilha Githinji, Xi Yuan, Zhenglin Chen, Ijaz Gul, Dingqi Shang, Wen Liang, Jianming Deng, Dan Zeng, Dongmei yu, Chenggang Yan, Peiwu QinSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph theoretic approach to model and incorporate it into the latent code of an autoencoder via a refinement module we term PopuSense. PopuSense seeks to capture additional intra-group variations inherent in biomedical data that a local or global context of the convolutional model might miss or smooth out. Proof-of-concept experiments on contrast-based and texture-based images, with minimal adaptation, encounter the existing preference for intensity-based input. Nevertheless, PopuSense demonstrates improved separability in contrast-based images, presenting an additional avenue for refining representations learned by a model.
- [597] arXiv:2403.08757 (replaced) [pdf, html, other]
-
Title: Efficient Combinatorial Optimization via Heat DiffusionComments: Code is available in this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Applied Physics (physics.app-ph)
Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature. The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global this http URL overcome this challenge, diverging from conventional efforts of expanding the solver's search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems.The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization.
- [598] arXiv:2403.12120 (replaced) [pdf, html, other]
-
Title: Light Curve Classification with DistClassiPy: a new distance-based classifierComments: Accepted for publication in Astronomy and Computing (2024). 24 pages, 19 figuresSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
The rise of synoptic sky surveys has ushered in an era of big data in time-domain astronomy, making data science and machine learning essential tools for studying celestial objects. While tree-based models (e.g. Random Forests) and deep learning models dominate the field, we explore the use of different distance metrics to aid in the classification of astrophysical objects. We developed DistClassiPy, a new distance metric based classifier. The direct use of distance metrics is unexplored in time-domain astronomy, but distance-based methods can help make classification more interpretable and decrease computational costs. In particular, we applied DistClassiPy to classify light curves of variable stars, comparing the distances between objects of different classes. Using 18 distance metrics on a catalog of 6,000 variable stars across 10 classes, we demonstrate classification and dimensionality reduction. Our classifier meets state-of-the-art performance but has lower computational requirements and improved interpretability. Additionally, DistClassiPy can be tailored to specific objects by identifying the most effective distance metric for that classification. To facilitate broader applications within and beyond astronomy, we have made DistClassiPy open-source and available at this https URL.
- [599] arXiv:2403.17436 (replaced) [pdf, html, other]
-
Title: Particle identification with machine learning from incomplete data in the ALICE experimentMaja Karwowska, Łukasz Graczykowski, Kamil Deja, Miłosz Kasak, Małgorzata Janik (for the ALICE collaboration)Comments: Proceedings of 3rd Artificial Intelligence for the Electron-Ion Collider workshop -- AI4EIC2023, 28.11-1.12.2023Journal-ref: JINST 19, C07013 (2024)Subjects: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
The ALICE experiment at the LHC measures properties of the strongly interacting matter formed in ultrarelativistic heavy-ion collisions. Such studies require accurate particle identification (PID). ALICE provides PID information via several detectors for particles with momentum from about 100 MeV/c up to 20 GeV/c. Traditionally, particles are selected with rectangular cuts. A much better performance can be achieved with machine learning (ML) methods. Our solution uses multiple neural networks (NN) serving as binary classifiers. Moreover, we extended our particle classifier with Feature Set Embedding and attention in order to train on data with incomplete samples. We also present the integration of the ML project with the ALICE analysis software, and we discuss domain adaptation, the ML technique needed to transfer the knowledge between simulated and real experimental data.
- [600] arXiv:2405.10870 (replaced) [pdf, html, other]
-
Title: Multicenter Privacy-Preserving Model Training for Deep Learning Brain Metastases AutosegmentationYixing Huang, Zahra Khodabakhshi, Ahmed Gomaa, Manuel Schmidt, Rainer Fietkau, Matthias Guckenberger, Nicolaus Andratschke, Christoph Bert, Stephanie Tanadini-Lang, Florian PutzComments: Official published version in the Green Journal: this https URLJournal-ref: Radiotherapy & Oncology. 2024, 198, 110419, 1-8Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Objectives: This work aims to explore the impact of multicenter data heterogeneity on deep learning brain metastases (BM) autosegmentation performance, and assess the efficacy of an incremental transfer learning technique, namely learning without forgetting (LWF), to improve model generalizability without sharing raw data.
Materials and methods: A total of six BM datasets from University Hospital Erlangen (UKER), University Hospital Zurich (USZ), Stanford, UCSF, NYU and BraTS Challenge 2023 on BM segmentation were used for this evaluation. First, the multicenter performance of a convolutional neural network (DeepMedic) for BM autosegmentation was established for exclusive single-center training and for training on pooled data, respectively. Subsequently bilateral collaboration was evaluated, where a UKER pretrained model is shared to another center for further training using transfer learning (TL) either with or without LWF.
Results: For single-center training, average F1 scores of BM detection range from 0.625 (NYU) to 0.876 (UKER) on respective single-center test data. Mixed multicenter training notably improves F1 scores at Stanford and NYU, with negligible improvement at other centers. When the UKER pretrained model is applied to USZ, LWF achieves a higher average F1 score (0.839) than naive TL (0.570) and single-center training (0.688) on combined UKER and USZ test data. Naive TL improves sensitivity and contouring accuracy, but compromises precision. Conversely, LWF demonstrates commendable sensitivity, precision and contouring accuracy. When applied to Stanford, similar performance was observed.
Conclusion: Data heterogeneity results in varying performance in BM autosegmentation, posing challenges to model generalizability. LWF is a promising approach to peer-to-peer privacy-preserving model training. - [601] arXiv:2406.08401 (replaced) [pdf, html, other]
-
Title: Nystr\"om Kernel Stein DiscrepancyComments: Update proof of Lemma B.3, milder Assumption 1, more experimentsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration -- with runtime $\mathcal O\!\left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points -- , show its $\sqrt{n}$-consistency under the null with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks.
- [602] arXiv:2407.01926 (replaced) [pdf, other]
-
Title: Chemical Shift Encoding based Double Bonds Quantification in Triglycerides using Deep Image PriorChaoxing Huang, Ziqiang Yu, Zijian Gao, Qiuyi Shen, Queenie Chan, Vincent Wai-Sun Wong, Winnie Chiu-Wing Chu, Weitian ChenSubjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
This study evaluated a deep learning-based method using Deep Image Prior (DIP) to quantify triglyceride double bonds from chemical-shift encoded multi-echo gradient echo images without network training. We employed a cost function based on signal constraints to iteratively update the neural network on a single dataset. The method was validated using phantom experiments and in vivo scans. Results showed close alignment between measured and reference double bond values, with phantom experiments yielding a Pearson correlation coefficient of 0.96 (p = .0005). In vivo results demonstrated good agreement in subcutaneous fat. We conclude that Deep Image Prior shows feasibility for quantifying double bonds and fatty acid content from chemical-shift encoded multi-echo MRI.
- [603] arXiv:2407.05433 (replaced) [pdf, html, other]
-
Title: An efficient algorithm for solving linear equality-constrained LQR problemsComments: 6 pagesSubjects: Optimization and Control (math.OC); Robotics (cs.RO)
We present a new algorithm for solving linear-quadratic regulator (LQR) problems with linear equality constraints, also known as constrained LQR (CLQR) problems. Our method's sequential runtime is linear in the number of stages and constraints, and its parallel runtime is logarithmic in the number of stages. The main technical contribution of this paper is the derivation of parallelizable techniques for eliminating the linear equality constraints while preserving the standard positive (semi-)definiteness requirements of LQR problems.
- [604] arXiv:2407.06100 (replaced) [pdf, html, other]
-
Title: Leveraging data-driven weather models for improving numerical weather prediction skill through large-scale spectral nudgingSyed Zahid Husain, Leo Separovic, Jean-François Caron, Rabah Aider, Mark Buehner, Stéphane Chamberland, Ervig Lapalme, Ron McTaggart-Cowan, Christopher Subich, Paul A. Vaillancourt, Jing Yang, Ayrton ZadraSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
Operational meteorological forecasting has long relied on physics-based numerical weather prediction (NWP) models. Recently, this landscape is facing disruption by the advent of data-driven artificial intelligence (AI)-based weather models, which offer tremendous computational performance and competitive forecasting skill. However, data-driven models for medium-range forecasting generally suffer from major limitations, including low effective resolution and a narrow range of predicted variables. This study illustrates the relative strengths and weaknesses of these competing paradigms using the GEM (Global Environmental Multiscale) and GraphCast models to represent physics-based and AI-based approaches, respectively. By analyzing global predictions from these two models against observations and analyses in both physical and spectral spaces, this study demonstrates that GraphCast-predicted large scales outperform GEM, particularly for longer lead times. Building on this insight, a hybrid NWP-AI system is proposed, wherein GEM-predicted large-scale state variables are spectrally nudged toward GraphCast predictions, while allowing GEM to freely generate fine-scale details critical for weather extremes. Results indicate that this hybrid approach is capable of leveraging the strengths of GraphCast to enhance the prediction skill of the GEM model. Importantly, trajectories of tropical cyclones are predicted with enhanced accuracy without significant changes in intensity. Furthermore, this new hybrid system ensures that meteorologists have access to a complete set of forecast variables, including those relevant for high-impact weather events.
- [605] arXiv:2407.07689 (replaced) [pdf, html, other]
-
Title: Orthogonal projectors of binary LCD codesSubjects: Combinatorics (math.CO); Information Theory (cs.IT)
We prove that binary even LCD code and some graphs are in one-to-one correspondence in a certain way. Furthermore, we show that adjacency matrices of non-isomorphic simple graphs give inequivalent binary LCD codes, and vice versa.
- [606] arXiv:2407.07720 (replaced) [pdf, html, other]
-
Title: SvANet: A Scale-variant Attention-based Network for Small Medical Object SegmentationComments: 14 pages, 9 figures, under reviewSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Early detection and accurate diagnosis can predict the risk of malignant disease transformation, thereby increasing the probability of effective treatment. A mild syndrome with small infected regions is an ominous warning and is foremost in the early diagnosis of diseases. Deep learning algorithms, such as convolutional neural networks (CNNs), have been used to segment natural or medical objects, showing promising results. However, analyzing medical objects of small areas in images remains a challenge due to information losses and compression defects caused by convolution and pooling operations in CNNs. These losses and defects become increasingly significant as the network deepens, particularly for small medical objects. To address these challenges, we propose a novel scale-variant attention-based network (SvANet) for accurate small-scale object segmentation in medical images. The SvANet consists of Monte Carlo attention, scale-variant attention, and vision transformer, which incorporates cross-scale features and alleviates compression artifacts for enhancing the discrimination of small medical objects. Quantitative experimental results demonstrate the superior performance of SvANet, achieving 96.12%, 96.11%, 89.79%, 84.15%, 80.25%, 73.05%, and 72.58% in mean Dice coefficient for segmenting kidney tumors, skin lesions, hepatic tumors, polyps, surgical excision cells, retinal vasculatures, and sperms, which occupy less than 1% of the image areas in KiTS23, ISIC 2018, ATLAS, PolypGen, TissueNet, FIVES, and SpermHealth datasets, respectively.
- [607] arXiv:2407.10107 (replaced) [pdf, html, other]
-
Title: Two-Player Zero-Sum Hybrid GamesComments: 21 pagesSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
In this paper, we formulate a two-player zero-sum game under dynamic constraints defined by hybrid dynamical equations. The game consists of a min-max problem involving a cost functional that depends on the actions and resulting solutions to the hybrid system, defined as functions of hybrid time and, hence, can flow or jump. A terminal set conviniently defined allows to recast both finite and infinite horizon problems. We present sufficient conditions given in terms of Hamilton-Jacobi-Bellman-Isaacs-like equations to guarantee to attain a solution to the game. It is shown that when the players select the optimal strategy, the value function can be evaluated without computing solutions to the hybrid system. Under additional conditions, we show that the optimal state-feedback laws render a set of interest asymptotically stable for the resulting hybrid closed-loop system. Applications of this problem, as presented here, include a disturbance rejection scenario for which the effect of the perturbation is upper bounded, and a security scenario in which we formulate an optimal control problem under the presence of a maximizing attack.
- [608] arXiv:2407.17146 (replaced) [pdf, html, other]
-
Title: Quantifying variabilities in cardiac digital twin models of the electrocardiogramComments: 21 pages, 7 figures, 2 tablesSubjects: Medical Physics (physics.med-ph); Numerical Analysis (math.NA); Tissues and Organs (q-bio.TO)
Cardiac digital twins (CDTs) of human cardiac electrophysiology (EP) are digital replicas of patient hearts that match like-for-like clinical observations. The electrocardiogram (ECG), as the most prevalent non-invasive observation of cardiac electrophysiology, is considered an ideal target for CDT calibration. Recent advanced CDT calibration methods have demonstrated their ability to minimize discrepancies between simulated and measured ECG signals, effectively replicating all key morphological features relevant to diagnostics. However, due to the inherent nature of clinical data acquisition and CDT model generation pipelines, discrepancies inevitably arise between the real physical electrophysiology in a patient and the simulated virtual electrophysiology in a CDT. In this study, we aim to qualitatively and quantitatively analyze the impact of these uncertainties on ECG morphology and diagnostic markers. We analyze residual beat-to-beat variability in ECG recordings obtained from healthy subjects and patients. Using a biophysically detailed and anatomically accurate computational model of whole-heart electrophysiology combined with a detailed torso model calibrated to closely replicate measured ECG signals, we vary anatomical factors (heart location, orientation, size), heterogeneity in electrical conductivities in the heart and torso, and electrode placements across ECG leads to assess their qualitative impact on ECG morphology. Our study demonstrates that diagnostically relevant ECG features and overall morphology appear relatively robust against the investigated uncertainties. This resilience is consistent with the narrow distribution of ECG due to residual beat-to-beat variability observed in both healthy subjects and patients.
- [609] arXiv:2407.17236 (replaced) [pdf, html, other]
-
Title: Statistical Batch-Based Bearing Fault DetectionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In the domain of rotating machinery, bearings are vulnerable to different mechanical faults, including ball, inner, and outer race faults. Various techniques can be used in condition-based monitoring, from classical signal analysis to deep learning methods. Based on the complex working conditions of rotary machines, multivariate statistical process control charts such as Hotelling's $T^2$ and Squared Prediction Error are useful for providing early warnings. However, these methods are rarely applied to condition monitoring of rotating machinery due to the univariate nature of the datasets. In the present paper, we propose a multivariate statistical process control-based fault detection method that utilizes multivariate data composed of Fourier transform features extracted for fixed-time batches. Our approach makes use of the multidimensional nature of Fourier transform characteristics, which record more detailed information about the machine's status, in an effort to enhance early defect detection and diagnosis. Experiments with varying vibration measurement locations (Fan End, Drive End), fault types (ball, inner, and outer race faults), and motor loads (0-3 horsepower) are used to validate the suggested approach. The outcomes illustrate our method's effectiveness in fault detection and point to possible broader uses in industrial maintenance.
- [610] arXiv:2407.17324 (replaced) [pdf, html, other]
-
Title: Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly PopulationNikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis SarigianndisSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce a novel methodology for the classification of demented and non-demented elderly patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and Dem3D EfficientNet. These models work synergistically to enhance decision-making accuracy, leveraging their collective strengths. Tested on the Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
- [611] arXiv:2407.17413 (replaced) [pdf, html, other]
-
Title: $A^*$ for Graphs of Convex SetsSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
We present a novel algorithm that fuses the existing convex-programming based approach with heuristic information to find optimality guarantees and near-optimal paths for the Shortest Path Problem in the Graph of Convex Sets (SPP-GCS). Our method, inspired by $A^*$, initiates a best-first-like procedure from a designated subset of vertices and iteratively expands it until further growth is neither possible nor beneficial. Traditionally, obtaining solutions with bounds for an optimization problem involves solving a relaxation, modifying the relaxed solution to a feasible one, and then comparing the two solutions to establish bounds. However, for SPP-GCS, we demonstrate that reversing this process can be more advantageous, especially with Euclidean travel costs. In other words, we initially employ $A^*$ to find a feasible solution for SPP-GCS, then solve a convex relaxation restricted to the vertices explored by $A^*$ to obtain a relaxed solution, and finally, compare the solutions to derive bounds. We present numerical results to highlight the advantages of our algorithm over the existing approach in terms of the sizes of the convex programs solved and computation time.