We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 146 entries: 1-146 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 3 May 24

[1]  arXiv:2405.00740 [pdf, other]
Title: Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Comments: 14 pages, 8 figures, 7 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

[2]  arXiv:2405.00749 [pdf, other]
Title: More is Better: Deep Domain Adaptation with Multiple Sources
Comments: Accepted by IJCAI 2024. arXiv admin note: text overlap with arXiv:2002.12169
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In many practical applications, it is often difficult and expensive to obtain large-scale labeled data to train state-of-the-art deep neural networks. Therefore, transferring the learned knowledge from a separate, labeled source domain to an unlabeled or sparsely labeled target domain becomes an appealing alternative. However, direct transfer often results in significant performance decay due to domain shift. Domain adaptation (DA) aims to address this problem by aligning the distributions between the source and target domains. Multi-source domain adaptation (MDA) is a powerful and practical extension in which the labeled data may be collected from multiple sources with different distributions. In this survey, we first define various MDA strategies. Then we systematically summarize and compare modern MDA methods in the deep learning era from different perspectives, followed by commonly used datasets and a brief benchmark. Finally, we discuss future research directions for MDA that are worth investigating.

[3]  arXiv:2405.00754 [pdf, other]
Title: CLIPArTT: Light-weight Adaptation of CLIP to New Domains at Test Time
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Pre-trained vision-language models (VLMs), exemplified by CLIP, demonstrate remarkable adaptability across zero-shot classification tasks without additional training. However, their performance diminishes in the presence of domain shifts. In this study, we introduce CLIP Adaptation duRing Test-Time (CLIPArTT), a fully test-time adaptation (TTA) approach for CLIP, which involves automatic text prompts construction during inference for their use as text supervision. Our method employs a unique, minimally invasive text prompt tuning process, wherein multiple predicted classes are aggregated into a single new text prompt, used as pseudo label to re-classify inputs in a transductive manner. Additionally, we pioneer the standardization of TTA benchmarks (e.g., TENT) in the realm of VLMs. Our findings demonstrate that, without requiring additional transformations nor new trainable modules, CLIPArTT enhances performance dynamically across non-corrupted datasets such as CIFAR-10, corrupted datasets like CIFAR-10-C and CIFAR-10.1, alongside synthetic datasets such as VisDA-C. This research underscores the potential for improving VLMs' adaptability through novel test-time strategies, offering insights for robust performance across varied datasets and environments. The code can be found at: https://github.com/dosowiechi/CLIPArTT.git

[4]  arXiv:2405.00760 [pdf, other]
Title: Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
Comments: N/A
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.

[5]  arXiv:2405.00791 [pdf, other]
Title: Obtaining Favorable Layouts for Multiple Object Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.

[6]  arXiv:2405.00794 [pdf, other]
Title: Coherent 3D Portrait Video Reconstruction via Triplane Fusion
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.

[7]  arXiv:2405.00857 [pdf, other]
Title: Brighteye: Glaucoma Screening with Color Fundus Photographs based on Vision Transformer
Comments: ISBI 2024, JustRAIGS challenge, glaucoma detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Differences in image quality, lighting conditions, and patient demographics pose challenges to automated glaucoma detection from color fundus photography. Brighteye, a method based on Vision Transformer, is proposed for glaucoma detection and glaucomatous feature classification. Brighteye learns long-range relationships among pixels within large fundus images using a self-attention mechanism. Prior to being input into Brighteye, the optic disc is localized using YOLOv8, and the region of interest (ROI) around the disc center is cropped to ensure alignment with clinical practice. Optic disc detection improves the sensitivity at 95% specificity from 79.20% to 85.70% for glaucoma detection and the Hamming distance from 0.2470 to 0.1250 for glaucomatous feature classification. In the developmental stage of the Justified Referral in AI Glaucoma Screening (JustRAIGS) challenge, the overall outcome secured the fifth position out of 226 entries.

[8]  arXiv:2405.00858 [pdf, other]
Title: Guided Conditional Diffusion Classifier (ConDiff) for Enhanced Prediction of Infection in Diabetic Foot Ulcers
Subjects: Computer Vision and Pattern Recognition (cs.CV)

To detect infected wounds in Diabetic Foot Ulcers (DFUs) from photographs, preventing severe complications and amputations. Methods: This paper proposes the Guided Conditional Diffusion Classifier (ConDiff), a novel deep-learning infection detection model that combines guided image synthesis with a denoising diffusion model and distance-based classification. The process involves (1) generating guided conditional synthetic images by injecting Gaussian noise to a guide image, followed by denoising the noise-perturbed image through a reverse diffusion process, conditioned on infection status and (2) classifying infections based on the minimum Euclidean distance between synthesized images and the original guide image in embedding space. Results: ConDiff demonstrated superior performance with an accuracy of 83% and an F1-score of 0.858, outperforming state-of-the-art models by at least 3%. The use of a triplet loss function reduces overfitting in the distance-based classifier. Conclusions: ConDiff not only enhances diagnostic accuracy for DFU infections but also pioneers the use of generative discriminative models for detailed medical image analysis, offering a promising approach for improving patient outcomes.

[9]  arXiv:2405.00876 [pdf, other]
Title: Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.

[10]  arXiv:2405.00878 [pdf, other]
Title: SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

[11]  arXiv:2405.00892 [pdf, other]
Title: Wake Vision: A Large-scale, Diverse Dataset and Benchmark Suite for TinyML Person Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Machine learning applications on extremely low-power devices, commonly referred to as tiny machine learning (TinyML), promises a smarter and more connected world. However, the advancement of current TinyML research is hindered by the limited size and quality of pertinent datasets. To address this challenge, we introduce Wake Vision, a large-scale, diverse dataset tailored for person detection -- the canonical task for TinyML visual sensing. Wake Vision comprises over 6 million images, which is a hundredfold increase compared to the previous standard, and has undergone thorough quality filtering. Using Wake Vision for training results in a 2.41\% increase in accuracy compared to the established benchmark. Alongside the dataset, we provide a collection of five detailed benchmark sets that assess model performance on specific segments of the test data, such as varying lighting conditions, distances from the camera, and demographic characteristics of subjects. These novel fine-grained benchmarks facilitate the evaluation of model quality in challenging real-world scenarios that are often ignored when focusing solely on overall accuracy. Through an evaluation of a MobileNetV2 TinyML model on the benchmarks, we show that the input resolution plays a more crucial role than the model width in detecting distant subjects and that the impact of quantization on model robustness is minimal, thanks to the dataset quality. These findings underscore the importance of a detailed evaluation to identify essential factors for model development. The dataset, benchmark suite, code, and models are publicly available under the CC-BY 4.0 license, enabling their use for commercial use cases.

[12]  arXiv:2405.00900 [pdf, other]
Title: DiL-NeRF: Delving into Lidar for Neural Radiance Field on Street Scenes
Comments: CVPR2024 Highlights
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.

[13]  arXiv:2405.00906 [pdf, other]
Title: LOTUS: Improving Transformer Efficiency with Sparsity Pruning and Data Lottery Tickets
Authors: Ojasw Upadhyay
Comments: 3 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Vision transformers have revolutionized computer vision, but their computational demands present challenges for training and deployment. This paper introduces LOTUS (LOttery Transformers with Ultra Sparsity), a novel method that leverages data lottery ticket selection and sparsity pruning to accelerate vision transformer training while maintaining accuracy. Our approach focuses on identifying and utilizing the most informative data subsets and eliminating redundant model parameters to optimize the training process. Through extensive experiments, we demonstrate the effectiveness of LOTUS in achieving rapid convergence and high accuracy with significantly reduced computational requirements. This work highlights the potential of combining data selection and sparsity techniques for efficient vision transformer training, opening doors for further research and development in this area.

[14]  arXiv:2405.00908 [pdf, ps, other]
Title: Transformer-Based Self-Supervised Learning for Histopathological Classification of Ischemic Stroke Clot Origin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Background and Purpose: Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention yet is often undetermined. This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin from histopathological images. Methods: The dataset included whole slide images (WSI) from the STRIP AI Kaggle challenge, consisting of retrieved clots from ischemic stroke patients following mechanical thrombectomy. Transformer-based deep learning models were developed using transfer learning and self-supervised pretraining for classifying WSI. Customizations included an attention pooling layer, weighted loss function, and threshold optimization. Various model architectures were tested and compared, and model performances were primarily evaluated using weighted logarithmic loss. Results: The model achieved a logloss score of 0.662 in cross-validation and 0.659 on the test set. Different model backbones were compared, with the swin_large_patch4_window12_384 showed higher performance. Thresholding techniques for clot origin classification were employed to balance false positives and negatives. Conclusion: The study demonstrates the extent of efficacy of transformer-based deep learning models in identifying ischemic stroke clot origins from histopathological images and emphasizes the need for refined modeling techniques specifically adapted to thrombi WSI. Further research is needed to improve model performance, interpretability, validate its effectiveness. Future enhancement could include integrating larger patient cohorts, advanced preprocessing strategies, and exploring ensemble multimodal methods for enhanced diagnostic accuracy.

[15]  arXiv:2405.00915 [pdf, other]
Title: EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
Comments: 25 pages. 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.

[16]  arXiv:2405.00942 [pdf, other]
Title: LLaVA Finds Free Lunch: Teaching Human Behavior Improves Content Understanding Abilities Of LLMs
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Communication is defined as ``Who says what to whom with what effect.'' A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 40 video and image understanding tasks over 23 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.

[17]  arXiv:2405.00951 [pdf, other]
Title: Hyperspectral Band Selection based on Generalized 3DTV and Tensor CUR Decomposition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC)

Hyperspectral Imaging (HSI) serves as an important technique in remote sensing. However, high dimensionality and data volume typically pose significant computational challenges. Band selection is essential for reducing spectral redundancy in hyperspectral imagery while retaining intrinsic critical information. In this work, we propose a novel hyperspectral band selection model by decomposing the data into a low-rank and smooth component and a sparse one. In particular, we develop a generalized 3D total variation (G3DTV) by applying the $\ell_1^p$-norm to derivatives to preserve spatial-spectral smoothness. By employing the alternating direction method of multipliers (ADMM), we derive an efficient algorithm, where the tensor low-rankness is implied by the tensor CUR decomposition. We demonstrate the effectiveness of the proposed approach through comparisons with various other state-of-the-art band selection techniques using two benchmark real-world datasets. In addition, we provide practical guidelines for parameter selection in both noise-free and noisy scenarios.

[18]  arXiv:2405.00954 [pdf, other]
Title: X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation
Comments: ICML2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.

[19]  arXiv:2405.00962 [pdf, other]
Title: FITA: Fine-grained Image-Text Aligner for Radiology Report Generation
Comments: 11 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Radiology report generation aims to automatically generate detailed and coherent descriptive reports alongside radiology images. Previous work mainly focused on refining fine-grained image features or leveraging external knowledge. However, the precise alignment of fine-grained image features with corresponding text descriptions has not been considered. This paper presents a novel method called Fine-grained Image-Text Aligner (FITA) to construct fine-grained alignment for image and text features. It has three novel designs: Image Feature Refiner (IFR), Text Feature Refiner (TFR) and Contrastive Aligner (CA). IFR and TFR aim to learn fine-grained image and text features, respectively. We achieve this by leveraging saliency maps to effectively fuse symptoms with corresponding abnormal visual regions, and by utilizing a meticulously constructed triplet set for training. Finally, CA module aligns fine-grained image and text features using contrastive loss for precise alignment. Results show that our method surpasses existing methods on the widely used benchmark

[20]  arXiv:2405.00983 [pdf, other]
Title: LLM-AD: Large Language Model based Audio Description System
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.

[21]  arXiv:2405.00989 [pdf, ps, other]
Title: Estimate the building height at a 10-meter resolution based on Sentinel data
Authors: Xin Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Building height is an important indicator for scientific research and practical application. However, building height products with a high spatial resolution (10m) are still very scarce. To meet the needs of high-resolution building height estimation models, this study established a set of spatial-spectral-temporal feature databases, combining SAR data provided by Sentinel-1, optical data provided by Sentinel-2, and shape data provided by building footprints. The statistical indicators on the time scale are extracted to form a rich database of 160 features. This study combined with permutation feature importance, Shapley Additive Explanations, and Random Forest variable importance, and the final stable features are obtained through an expert scoring system. This study took 12 large, medium, and small cities in the United States as the training data. It used moving windows to aggregate the pixels to solve the impact of SAR image displacement and building shadows. This study built a building height model based on a random forest model and compared three model ensemble methods of bagging, boosting, and stacking. To evaluate the accuracy of the prediction results, this study collected Lidar data in the test area, and the evaluation results showed that its R-Square reached 0.78, which can prove that the building height can be obtained effectively. The fast production of high-resolution building height data can support large-scale scientific research and application in many fields.

[22]  arXiv:2405.00998 [pdf, other]
Title: Part-aware Shape Generation with Latent 3D Diffusion of Neural Voxel Fields
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a novel latent 3D diffusion model for the generation of neural voxel fields, aiming to achieve accurate part-aware structures. Compared to existing methods, there are two key designs to ensure high-quality and accurate part-aware generation. On one hand, we introduce a latent 3D diffusion process for neural voxel fields, enabling generation at significantly higher resolutions that can accurately capture rich textural and geometric details. On the other hand, a part-aware shape decoder is introduced to integrate the part codes into the neural voxel fields, guiding the accurate part decomposition and producing high-quality rendering results. Through extensive experimentation and comparisons with state-of-the-art methods, we evaluate our approach across four different classes of data. The results demonstrate the superior generative capabilities of our proposed method in part-aware shape generation, outperforming existing state-of-the-art methods.

[23]  arXiv:2405.01002 [pdf, other]
Title: Spider: A Unified Framework for Context-dependent Concept Understanding
Comments: Accepted by ICML 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Different from the context-independent (CI) concepts such as human, car, and airplane, context-dependent (CD) concepts require higher visual understanding ability, such as camouflaged object and medical lesion. Despite the rapid advance of many CD understanding tasks in respective branches, the isolated evolution leads to their limited cross-domain generalisation and repetitive technique innovation. Since there is a strong coupling relationship between foreground and background context in CD tasks, existing methods require to train separate models in their focused domains. This restricts their real-world CD concept understanding towards artificial general intelligence (AGI). We propose a unified model with a single set of parameters, Spider, which only needs to be trained once. With the help of the proposed concept filter driven by the image-mask group prompt, Spider is able to understand and distinguish diverse strong context-dependent concepts to accurately capture the Prompter's intention. Without bells and whistles, Spider significantly outperforms the state-of-the-art specialized models in 8 different context-dependent segmentation tasks, including 4 natural scenes (salient, camouflaged, and transparent objects and shadow) and 4 medical lesions (COVID-19, polyp, breast, and skin lesion with color colonoscopy, CT, ultrasound, and dermoscopy modalities). Besides, Spider shows obvious advantages in continuous learning. It can easily complete the training of new tasks by fine-tuning parameters less than 1\% and bring a tolerable performance degradation of less than 5\% for all old tasks. The source code will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/Spider-UniCDSeg}{Spider-UniCDSeg}.

[24]  arXiv:2405.01008 [pdf, other]
Title: On Mechanistic Knowledge Localization in Text-to-Image Generative Models
Comments: Appearing in ICML 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Identifying layers within text-to-image models which control visual attributes can facilitate efficient model editing through closed-form updates. Recent work, leveraging causal tracing show that early Stable-Diffusion variants confine knowledge primarily to the first layer of the CLIP text-encoder, while it diffuses throughout the UNet.Extending this framework, we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing fails in pinpointing localized knowledge, highlighting challenges in model editing. To address this issue, we introduce the concept of Mechanistic Localization in text-to-image models, where knowledge about various visual attributes (e.g., ``style", ``objects", ``facts") can be mechanistically localized to a small fraction of layers in the UNet, thus facilitating efficient model editing. We localize knowledge using our method LocoGen which measures the direct effect of intermediate layers to output generation by performing interventions in the cross-attention layers of the UNet. We then employ LocoEdit, a fast closed-form editing method across popular open-source text-to-image models (including the latest SD-XL)and explore the possibilities of neuron-level model editing. Using Mechanistic Localization, our work offers a better view of successes and failures in localization-based text-to-image model editing. Code will be available at \href{https://github.com/samyadeepbasu/LocoGen}{https://github.com/samyadeepbasu/LocoGen}.

[25]  arXiv:2405.01016 [pdf, other]
Title: Addressing Diverging Training Costs using Local Restoration for Precise Bird's Eye View Map Construction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky self-driving, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel Trumpet Neural Network (TNN) mechanism. The framework utilizes LR BEV space and outputs an up-sampled semantic BEV map to create a memory-efficient pipeline. To this end, we introduce Local Restoration of BEV representation. Specifically, the up-sampled BEV representation has severely aliased, blocky signals, and thick semantic labels. Our proposed Local Restoration restores the signals and thins (or narrows down) the width of the labels. Our extensive experiments show that the TNN mechanism provides a plug-and-play memory-efficient pipeline, thereby enabling the effective estimation of real-sized (or precise) semantic labels for BEV map construction.

[26]  arXiv:2405.01028 [pdf, other]
Title: Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBA LAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption describing image. It is made possible by combining an Ensembled CLIP score, which considers the semantic alignment between the image and captions, with a Consensus score that accounts for the essentialness of the captions. Using this framework, we achieved notable success in the CVPR 2024 Workshop Challenge on Caption Re-ranking Evaluation at the New Frontiers for Zero-Shot Image Captioning Evaluation (NICE). Specifically, we secured third place based on the CIDEr metric, second in both the SPICE and METEOR metrics, and first in the ROUGE-L and all BLEU Score metrics. The code and configuration for the ECO framework are available at https://github.com/ DSBA-Lab/ECO .

[27]  arXiv:2405.01040 [pdf, other]
Title: Few Shot Class Incremental Learning using Vision-Language models
Comments: under review at Pattern Recognition Letters
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)

Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL).
In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance.

[28]  arXiv:2405.01065 [pdf, other]
Title: MFDS-Net: Multi-Scale Feature Depth-Supervised Network for Remote Sensing Change Detection with Global Semantic and Detail Information
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Change detection as an interdisciplinary discipline in the field of computer vision and remote sensing at present has been receiving extensive attention and research. Due to the rapid development of society, the geographic information captured by remote sensing satellites is changing faster and more complex, which undoubtedly poses a higher challenge and highlights the value of change detection tasks. We propose MFDS-Net: Multi-Scale Feature Depth-Supervised Network for Remote Sensing Change Detection with Global Semantic and Detail Information (MFDS-Net) with the aim of achieving a more refined description of changing buildings as well as geographic information, enhancing the localisation of changing targets and the acquisition of weak features. To achieve the research objectives, we use a modified ResNet_34 as backbone network to perform feature extraction and DO-Conv as an alternative to traditional convolution to better focus on the association between feature information and to obtain better training results. We propose the Global Semantic Enhancement Module (GSEM) to enhance the processing of high-level semantic information from a global perspective. The Differential Feature Integration Module (DFIM) is proposed to strengthen the fusion of different depth feature information, achieving learning and extraction of differential features. The entire network is trained and optimized using a deep supervision mechanism.
The experimental outcomes of MFDS-Net surpass those of current mainstream change detection networks. On the LEVIR dataset, it achieved an F1 score of 91.589 and IoU of 84.483, on the WHU dataset, the scores were F1: 92.384 and IoU: 86.807, and on the GZ-CD dataset, the scores were F1: 86.377 and IoU: 76.021. The code is available at https://github.com/AOZAKIiii/MFDS-Net

[29]  arXiv:2405.01066 [pdf, ps, other]
Title: HandSSCA: 3D Hand Mesh Reconstruction with State Space Channel Attention from RGB images
Comments: 13 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

Reconstructing a hand mesh from a single RGB image is a challenging task because hands are often occluded by objects. Most previous works attempted to introduce more additional information and adopt attention mechanisms to improve 3D reconstruction results, but it would increased computational complexity. This observation prompts us to propose a new and concise architecture while improving computational efficiency. In this work, we propose a simple and effective 3D hand mesh reconstruction network HandSSCA, which is the first to incorporate state space modeling into the field of hand pose estimation. In the network, we have designed a novel state space channel attention module that extends the effective sensory field, extracts hand features in the spatial dimension, and enhances hand regional features in the channel dimension. This design helps to reconstruct a complete and detailed hand mesh. Extensive experiments conducted on well-known datasets featuring challenging hand-object occlusions (such as FREIHAND, DEXYCB, and HO3D) demonstrate that our proposed HandSSCA achieves state-of-the-art performance while maintaining a minimal parameter count.

[30]  arXiv:2405.01071 [pdf, other]
Title: Callico: a Versatile Open-Source Document Image Annotation Platform
Comments: Accepted to ICDAR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.

[31]  arXiv:2405.01083 [pdf, other]
Title: MCMS: Multi-Category Information and Multi-Scale Stripe Attention for Blind Motion Deblurring
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning-based motion deblurring techniques have advanced significantly in recent years. This class of techniques, however, does not carefully examine the inherent flaws in blurry images. For instance, low edge and structural information are traits of blurry images. The high-frequency component of blurry images is edge information, and the low-frequency component is structure information. A blind motion deblurring network (MCMS) based on multi-category information and multi-scale stripe attention mechanism is proposed. Given the respective characteristics of the high-frequency and low-frequency components, a three-stage encoder-decoder model is designed. Specifically, the first stage focuses on extracting the features of the high-frequency component, the second stage concentrates on extracting the features of the low-frequency component, and the third stage integrates the extracted low-frequency component features, the extracted high-frequency component features, and the original blurred image in order to recover the final clear image. As a result, the model effectively improves motion deblurring by fusing the edge information of the high-frequency component and the structural information of the low-frequency component. In addition, a grouped feature fusion technique is developed so as to achieve richer, more three-dimensional and comprehensive utilization of various types of features at a deep level. Next, a multi-scale stripe attention mechanism (MSSA) is designed, which effectively combines the anisotropy and multi-scale information of the image, a move that significantly enhances the capability of the deep model in feature representation. Large-scale comparative studies on various datasets show that the strategy in this paper works better than the recently published measures.

[32]  arXiv:2405.01085 [pdf, other]
Title: Single Image Super-Resolution Based on Global-Local Information Synergy
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Although several image super-resolution solutions exist, they still face many challenges. CNN-based algorithms, despite the reduction in computational complexity, still need to improve their accuracy. While Transformer-based algorithms have higher accuracy, their ultra-high computational complexity makes them difficult to be accepted in practical applications. To overcome the existing challenges, a novel super-resolution reconstruction algorithm is proposed in this paper. The algorithm achieves a significant increase in accuracy through a unique design while maintaining a low complexity. The core of the algorithm lies in its cleverly designed Global-Local Information Extraction Module and Basic Block Module. By combining global and local information, the Global-Local Information Extraction Module aims to understand the image content more comprehensively so as to recover the global structure and local details in the image more accurately, which provides rich information support for the subsequent reconstruction process. Experimental results show that the comprehensive performance of the algorithm proposed in this paper is optimal, providing an efficient and practical new solution in the field of super-resolution reconstruction.

[33]  arXiv:2405.01088 [pdf, other]
Title: Type2Branch: Keystroke Biometrics based on a Dual-branch Architecture with Attention Mechanisms and Set2set Loss
Comments: 13 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In 2021, the pioneering work on TypeNet showed that keystroke dynamics verification could scale to hundreds of thousands of users with minimal performance degradation. Recently, the KVC-onGoing competition has provided an open and robust experimental protocol for evaluating keystroke dynamics verification systems of such scale, including considerations of algorithmic fairness. This article describes Type2Branch, the model and techniques that achieved the lowest error rates at the KVC-onGoing, in both desktop and mobile scenarios. The novelty aspects of the proposed Type2Branch include: i) synthesized timing features emphasizing user behavior deviation from the general population, ii) a dual-branch architecture combining recurrent and convolutional paths with various attention mechanisms, iii) a new loss function named Set2set that captures the global structure of the embedding space, and iv) a training curriculum of increasing difficulty. Considering five enrollment samples per subject of approximately 50 characters typed, the proposed Type2Branch achieves state-of-the-art performance with mean per-subject EERs of 0.77% and 1.03% on evaluation sets of respectively 15,000 and 5,000 subjects for desktop and mobile scenarios. With a uniform global threshold for all subjects, the EERs are 3.25% for desktop and 3.61% for mobile, outperforming previous approaches by a significant margin.

[34]  arXiv:2405.01090 [pdf, other]
Title: Learning Object States from Actions via Large Language Models
Comments: 19 pages of main content, 24 pages of supplementary material
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.

[35]  arXiv:2405.01095 [pdf, other]
Title: Transformers Fusion across Disjoint Samples for Hyperspectral Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

3D Swin Transformer (3D-ST) known for its hierarchical attention and window-based processing, excels in capturing intricate spatial relationships within images. Spatial-spectral Transformer (SST), meanwhile, specializes in modeling long-range dependencies through self-attention mechanisms. Therefore, this paper introduces a novel method: an attentional fusion of these two transformers to significantly enhance the classification performance of Hyperspectral Images (HSIs). What sets this approach apart is its emphasis on the integration of attentional mechanisms from both architectures. This integration not only refines the modeling of spatial and spectral information but also contributes to achieving more precise and accurate classification results. The experimentation and evaluation of benchmark HSI datasets underscore the importance of employing disjoint training, validation, and test samples. The results demonstrate the effectiveness of the fusion approach, showcasing its superiority over traditional methods and individual transformers. Incorporating disjoint samples enhances the robustness and reliability of the proposed methodology, emphasizing its potential for advancing hyperspectral image classification.

[36]  arXiv:2405.01101 [pdf, other]
Title: Enhancing Person Re-Identification via Uncertainty Feature Fusion and Wise Distance Aggregation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The quest for robust Person re-identification (Re-ID) systems capable of accurately identifying subjects across diverse scenarios remains a formidable challenge in surveillance and security applications. This study presents a novel methodology that significantly enhances Person Re-Identification (Re-ID) by integrating Uncertainty Feature Fusion (UFFM) with Wise Distance Aggregation (WDA). Tested on benchmark datasets - Market-1501, DukeMTMC-ReID, and MSMT17 - our approach demonstrates substantial improvements in Rank-1 accuracy and mean Average Precision (mAP). Specifically, UFFM capitalizes on the power of feature synthesis from multiple images to overcome the limitations imposed by the variability of subject appearances across different views. WDA further refines the process by intelligently aggregating similarity metrics, thereby enhancing the system's ability to discern subtle but critical differences between subjects. The empirical results affirm the superiority of our method over existing approaches, achieving new performance benchmarks across all evaluated datasets. Code is available on Github.

[37]  arXiv:2405.01105 [pdf, other]
Title: Image segmentation of treated and untreated tumor spheroids by Fully Convolutional Networks
Comments: 28 pages, 21 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)

Multicellular tumor spheroids (MCTS) are advanced cell culture systems for assessing the impact of combinatorial radio(chemo)therapy. They exhibit therapeutically relevant in-vivo-like characteristics from 3D cell-cell and cell-matrix interactions to radial pathophysiological gradients related to proliferative activity and nutrient/oxygen supply, altering cellular radioresponse. State-of-the-art assays quantify long-term curative endpoints based on collected brightfield image time series from large treated spheroid populations per irradiation dose and treatment arm. Here, spheroid control probabilities are documented analogous to in-vivo tumor control probabilities based on Kaplan-Meier curves. This analyses require laborious spheroid segmentation of up to 100.000 images per treatment arm to extract relevant structural information from the images, e.g., diameter, area, volume and circularity. While several image analysis algorithms are available for spheroid segmentation, they all focus on compact MCTS with clearly distinguishable outer rim throughout growth. However, treated MCTS may partly be detached and destroyed and are usually obscured by dead cell debris. We successfully train two Fully Convolutional Networks, UNet and HRNet, and optimize their hyperparameters to develop an automatic segmentation for both untreated and treated MCTS. We systematically validate the automatic segmentation on larger, independent data sets of spheroids derived from two human head-and-neck cancer cell lines. We find an excellent overlap between manual and automatic segmentation for most images, quantified by Jaccard indices at around 90%. For images with smaller overlap of the segmentations, we demonstrate that this error is comparable to the variations across segmentations from different biological experts, suggesting that these images represent biologically unclear or ambiguous cases.

[38]  arXiv:2405.01108 [pdf, other]
Title: Federated Learning with Heterogeneous Data Handling for Robust Vehicular Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In the pursuit of refining precise perception models for fully autonomous driving, continual online model training becomes essential. Federated Learning (FL) within vehicular networks offers an efficient mechanism for model training while preserving raw sensory data integrity. Yet, FL struggles with non-identically distributed data (e.g., quantity skew), leading to suboptimal convergence rates during model training. In previous work, we introduced FedLA, an innovative Label-Aware aggregation method addressing data heterogeneity in FL for generic scenarios.
In this paper, we introduce FedProx+LA, a novel FL method building upon the state-of-the-art FedProx and FedLA to tackle data heterogeneity, which is specifically tailored for vehicular networks. We evaluate the efficacy of FedProx+LA in continuous online object detection model training. Through a comparative analysis against conventional and state-of-the-art methods, our findings reveal the superior convergence rate of FedProx+LA. Notably, if the label distribution is very heterogeneous, our FedProx+LA approach shows substantial improvements in detection performance compared to baseline methods, also outperforming our previous FedLA approach. Moreover, both FedLA and FedProx+LA increase convergence speed by 30% compared to baseline methods.

[39]  arXiv:2405.01112 [pdf, other]
Title: Sports Analysis and VR Viewing System Based on Player Tracking and Pose Estimation with Multimodal and Multiview Sensors
Comments: arXiv admin note: text overlap with arXiv:2312.06409
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Sports analysis and viewing play a pivotal role in the current sports domain, offering significant value not only to coaches and athletes but also to fans and the media. In recent years, the rapid development of virtual reality (VR) and augmented reality (AR) technologies have introduced a new platform for watching games. Visualization of sports competitions in VR/AR represents a revolutionary technology, providing audiences with a novel immersive viewing experience. However, there is still a lack of related research in this area. In this work, we present for the first time a comprehensive system for sports competition analysis and real-time visualization on VR/AR platforms. First, we utilize multiview LiDARs and cameras to collect multimodal game data. Subsequently, we propose a framework for multi-player tracking and pose estimation based on a limited amount of supervised data, which extracts precise player positions and movements from point clouds and images. Moreover, we perform avatar modeling of players to obtain their 3D models. Ultimately, using these 3D player data, we conduct competition analysis and real-time visualization on VR/AR. Extensive quantitative experiments demonstrate the accuracy and robustness of our multi-player tracking and pose estimation framework. The visualization results showcase the immense potential of our sports visualization system on the domain of watching games on VR/AR devices. The multimodal competition dataset we collected and all related code will be released soon.

[40]  arXiv:2405.01113 [pdf, other]
Title: Domain-Transferred Synthetic Data Generation for Improving Monocular Depth Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

A major obstacle to the development of effective monocular depth estimation algorithms is the difficulty in obtaining high-quality depth data that corresponds to collected RGB images. Collecting this data is time-consuming and costly, and even data collected by modern sensors has limited range or resolution, and is subject to inconsistencies and noise. To combat this, we propose a method of data generation in simulation using 3D synthetic environments and CycleGAN domain transfer. We compare this method of data generation to the popular NYUDepth V2 dataset by training a depth estimation model based on the DenseDepth structure using different training sets of real and simulated data. We evaluate the performance of the models on newly collected images and LiDAR depth data from a Husky robot to verify the generalizability of the approach and show that GAN-transformed data can serve as an effective alternative to real-world data, particularly in depth estimation.

[41]  arXiv:2405.01126 [pdf, other]
Title: Detecting and clustering swallow events in esophageal long-term high-resolution manometry
Subjects: Computer Vision and Pattern Recognition (cs.CV)

High-resolution manometry (HRM) is the gold standard in diagnosing esophageal motility disorders. As HRM is typically conducted under short-term laboratory settings, intermittently occurring disorders are likely to be missed. Therefore, long-term (up to 24h) HRM (LTHRM) is used to gain detailed insights into the swallowing behavior. However, analyzing the extensive data from LTHRM is challenging and time consuming as medical experts have to analyze the data manually, which is slow and prone to errors. To address this challenge, we propose a Deep Learning based swallowing detection method to accurately identify swallowing events and secondary non-deglutitive-induced esophageal motility disorders in LTHRM data. We then proceed with clustering the identified swallows into distinct classes, which are analyzed by highly experienced clinicians to validate the different swallowing patterns. We evaluate our computational pipeline on a total of 25 LTHRMs, which were meticulously annotated by medical experts. By detecting more than 94% of all relevant swallow events and providing all relevant clusters for a more reliable diagnostic process among experienced clinicians, we are able to demonstrate the effectiveness as well as positive clinical impact of our approach to make LTHRM feasible in clinical care.

[42]  arXiv:2405.01130 [pdf, other]
Title: Automated Virtual Product Placement and Assessment in Images using Diffusion Models
Comments: Accepted at the 6th AI for Content Creation (AI4CC) workshop at CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In Virtual Product Placement (VPP) applications, the discrete integration of specific brand products into images or videos has emerged as a challenging yet important task. This paper introduces a novel three-stage fully automated VPP system. In the first stage, a language-guided image segmentation model identifies optimal regions within images for product inpainting. In the second stage, Stable Diffusion (SD), fine-tuned with a few example product images, is used to inpaint the product into the previously identified candidate regions. The final stage introduces an "Alignment Module", which is designed to effectively sieve out low-quality images. Comprehensive experiments demonstrate that the Alignment Module ensures the presence of the intended product in every generated image and enhances the average quality of images by 35%. The results presented in this paper demonstrate the effectiveness of the proposed VPP system, which holds significant potential for transforming the landscape of virtual advertising and marketing strategies.

[43]  arXiv:2405.01156 [pdf, other]
Title: Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

[44]  arXiv:2405.01170 [pdf, other]
Title: GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression
Comments: Accepted by IEEE TCSVT
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.

[45]  arXiv:2405.01175 [pdf, other]
Title: Uncertainty-aware self-training with expectation maximization basis transformation
Journal-ref: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Self-training is a powerful approach to deep learning. The key process is to find a pseudo-label for modeling. However, previous self-training algorithms suffer from the over-confidence issue brought by the hard labels, even some confidence-related regularizers cannot comprehensively catch the uncertainty. Therefore, we propose a new self-training framework to combine uncertainty information of both model and dataset. Specifically, we propose to use Expectation-Maximization (EM) to smooth the labels and comprehensively estimate the uncertainty information. We further design a basis extraction network to estimate the initial basis from the dataset. The obtained basis with uncertainty can be filtered based on uncertainty information. It can then be transformed into the real hard label to iteratively update the model and basis in the retraining process. Experiments on image classification and semantic segmentation show the advantages of our methods among confidence-aware self-training algorithms with 1-3 percentage improvement on different datasets.

[46]  arXiv:2405.01199 [pdf, other]
Title: Latent Fingerprint Matching via Dense Minutia Descriptor
Comments: 10 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Latent fingerprint matching is a daunting task, primarily due to the poor quality of latent fingerprints. In this study, we propose a deep-learning based dense minutia descriptor (DMD) for latent fingerprint matching. A DMD is obtained by extracting the fingerprint patch aligned by its central minutia, capturing detailed minutia information and texture information. Our dense descriptor takes the form of a three-dimensional representation, with two dimensions associated with the original image plane and the other dimension representing the abstract features. Additionally, the extraction process outputs the fingerprint segmentation map, ensuring that the descriptor is only valid in the foreground region. The matching between two descriptors occurs in their overlapping regions, with a score normalization strategy to reduce the impact brought by the differences outside the valid area. Our descriptor achieves state-of-the-art performance on several latent fingerprint datasets. Overall, our DMD is more representative and interpretable compared to previous methods.

[47]  arXiv:2405.01204 [pdf, other]
Title: Towards Cross-Scale Attention and Surface Supervision for Fractured Bone Segmentation in CT
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Bone segmentation is an essential step for the preoperative planning of fracture trauma surgery. The automated segmentation of fractured bone from computed tomography (CT) scans remains challenging, due to the large differences of fractures in position and morphology, and also the inherent anatomical characteristics of different bone structures. To alleviate these issues, we propose a cross-scale attention mechanism as well as a surface supervision strategy for fractured bone segmentation in CT. Specifically, a cross-scale attention mechanism is introduced to effectively aggregate the features among different scales to provide more powerful fracture representation. Moreover, a surface supervision strategy is employed, which explicitly constrains the network to pay more attention to the bone boundary. The efficacy of the proposed method is evaluated on a public dataset containing CT scans with hip fractures. The evaluation metrics are Dice similarity coefficient (DSC), average symmetric surface distance (ASSD), and Hausdorff distance (95HD). The proposed method achieves an average DSC of 93.36%, ASSD of 0.85mm, 95HD of 7.51mm. Our method offers an effective fracture segmentation approach for the pelvic CT examinations, and has the potential to be used for improving the segmentation performance of other types of fractures.

[48]  arXiv:2405.01217 [pdf, other]
Title: CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation
Comments: Accepted as an oral presentation by ICLR 2024 ML4RS workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We study the potential of noisy labels y to pretrain semantic segmentation models in a multi-modal learning framework for geospatial applications. Specifically, we propose a novel Cross-modal Sample Selection method (CromSS) that utilizes the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. Consistency of predictions across sensors $d$ is jointly informed by the entropy of P^{(d)}(x,c). Noisy label sampling we determine by the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). To verify the performance of our approach, we conduct experiments with Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the globally-sampled SSL4EO-S12 dataset. We pair those scenes with 9-class noisy labels sourced from the Google Dynamic World project for pretraining. Transfer learning evaluations (downstream task) on the DFC2020 dataset confirm the effectiveness of the proposed method for remote sensing image segmentation.

[49]  arXiv:2405.01228 [pdf, other]
Title: RaffeSDG: Random Frequency Filtering enabled Single-source Domain Generalization for Medical Image Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning models often encounter challenges in making accurate inferences when there are domain shifts between the source and target data. This issue is particularly pronounced in clinical settings due to the scarcity of annotated data resulting from the professional and private nature of medical data. Despite the existence of decent solutions, many of them are hindered in clinical settings due to limitations in data collection and computational complexity. To tackle domain shifts in data-scarce medical scenarios, we propose a Random frequency filtering enabled Single-source Domain Generalization algorithm (RaffeSDG), which promises robust out-of-domain inference with segmentation models trained on a single-source domain. A filter-based data augmentation strategy is first proposed to promote domain variability within a single-source domain by introducing variations in frequency space and blending homologous samples. Then Gaussian filter-based structural saliency is also leveraged to learn robust representations across augmented samples, further facilitating the training of generalizable segmentation models. To validate the effectiveness of RaffeSDG, we conducted extensive experiments involving out-of-domain inference on segmentation tasks for three human tissues imaged by four diverse modalities. Through thorough investigations and comparisons, compelling evidence was observed in these experiments, demonstrating the potential and generalizability of RaffeSDG. The code is available at https://github.com/liamheng/Non-IID_Medical_Image_Segmentation.

[50]  arXiv:2405.01230 [pdf, other]
Title: Evaluation of Video-Based rPPG in Challenging Environments: Artifact Mitigation and Network Resilience
Comments: 22 main article pages with 3 supplementary pages, journal
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

Video-based remote photoplethysmography (rPPG) has emerged as a promising technology for non-contact vital sign monitoring, especially under controlled conditions. However, the accurate measurement of vital signs in real-world scenarios faces several challenges, including artifacts induced by videocodecs, low-light noise, degradation, low dynamic range, occlusions, and hardware and network constraints. In this article, we systematically investigate comprehensive investigate these issues, measuring their detrimental effects on the quality of rPPG measurements. Additionally, we propose practical strategies for mitigating these challenges to improve the dependability and resilience of video-based rPPG systems. We detail methods for effective biosignal recovery in the presence of network limitations and present denoising and inpainting techniques aimed at preserving video frame integrity. Through extensive evaluations and direct comparisons, we demonstrate the effectiveness of the approaches in enhancing rPPG measurements under challenging environments, contributing to the development of more reliable and effective remote vital sign monitoring technologies.

[51]  arXiv:2405.01258 [pdf, other]
Title: Towards Consistent Object Detection via LiDAR-Camera Synergy
Comments: The source code will be made publicly available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. However, currently, no model exists that can simultaneously detect an object's position in both point clouds and images and ascertain their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at https://github.com/xifen523/COD.

[52]  arXiv:2405.01273 [pdf, other]
Title: Towards Inclusive Face Recognition Through Synthetic Ethnicity Alteration
Comments: 8 Pages
Journal-ref: Automatic Face and Gesture Recognition 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Numerous studies have shown that existing Face Recognition Systems (FRS), including commercial ones, often exhibit biases toward certain ethnicities due to under-represented data. In this work, we explore ethnicity alteration and skin tone modification using synthetic face image generation methods to increase the diversity of datasets. We conduct a detailed analysis by first constructing a balanced face image dataset representing three ethnicities: Asian, Black, and Indian. We then make use of existing Generative Adversarial Network-based (GAN) image-to-image translation and manifold learning models to alter the ethnicity from one to another. A systematic analysis is further conducted to assess the suitability of such datasets for FRS by studying the realistic skin-tone representation using Individual Typology Angle (ITA). Further, we also analyze the quality characteristics using existing Face image quality assessment (FIQA) approaches. We then provide a holistic FRS performance analysis using four different systems. Our findings pave the way for future research works in (i) developing both specific ethnicity and general (any to any) ethnicity alteration models, (ii) expanding such approaches to create databases with diverse skin tones, (iii) creating datasets representing various ethnicities which further can help in mitigating bias while addressing privacy concerns.

[53]  arXiv:2405.01311 [pdf, other]
Title: Imagine the Unseen: Occluded Pedestrian Detection via Adversarial Feature Completion
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Pedestrian detection has significantly progressed in recent years, thanks to the development of DNNs. However, detection performance at occluded scenes is still far from satisfactory, as occlusion increases the intra-class variance of pedestrians, hindering the model from finding an accurate classification boundary between pedestrians and background clutters. From the perspective of reducing intra-class variance, we propose to complete features for occluded regions so as to align the features of pedestrians across different occlusion patterns. An important premise for feature completion is to locate occluded regions. From our analysis, channel features of different pedestrian proposals only show high correlation values at visible parts and thus feature correlations can be used to model occlusion patterns. In order to narrow down the gap between completed features and real fully visible ones, we propose an adversarial learning method, which completes occluded features with a generator such that they can hardly be distinguished by the discriminator from real fully visible features. We report experimental results on the CityPersons, Caltech and CrowdHuman datasets. On CityPersons, we show significant improvements over five different baseline detectors, especially on the heavy occlusion subset. Furthermore, we show that our proposed method FeatComp++ achieves state-of-the-art results on all the above three datasets without relying on extra cues.

[54]  arXiv:2405.01326 [pdf, other]
Title: Multi-modal Learnable Queries for Image Aesthetics Assessment
Comments: Accepted by ICME2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image aesthetics assessment (IAA) is attracting wide interest with the prevalence of social media. The problem is challenging due to its subjective and ambiguous nature. Instead of directly extracting aesthetic features solely from the image, user comments associated with an image could potentially provide complementary knowledge that is useful for IAA. With existing large-scale pre-trained models demonstrating strong capabilities in extracting high-quality transferable visual and textual features, learnable queries are shown to be effective in extracting useful features from the pre-trained visual features. Therefore, in this paper, we propose MMLQ, which utilizes multi-modal learnable queries to extract aesthetics-related features from multi-modal pre-trained features. Extensive experimental results demonstrate that MMLQ achieves new state-of-the-art performance on multi-modal IAA, beating previous methods by 7.7% and 8.3% in terms of SRCC and PLCC, respectively.

[55]  arXiv:2405.01337 [pdf, other]
Title: Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.

[56]  arXiv:2405.01353 [pdf, other]
Title: Sparse multi-view hand-object reconstruction for unseen environments
Comments: Camera-ready version. Paper accepted to CVPRW 2024. 8 pages, 7 figures, 1 table
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.

[57]  arXiv:2405.01356 [pdf, other]
Title: Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
Comments: Accepted to CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.

[58]  arXiv:2405.01373 [pdf, other]
Title: ATOM: Attention Mixer for Efficient Dataset Distillation
Comments: Accepted for an oral presentation in CVPR-DD 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.

[59]  arXiv:2405.01409 [pdf, other]
Title: Goal-conditioned reinforcement learning for ultrasound navigation guidance
Comments: 11 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Transesophageal echocardiography (TEE) plays a pivotal role in cardiology for diagnostic and interventional procedures. However, using it effectively requires extensive training due to the intricate nature of image acquisition and interpretation. To enhance the efficiency of novice sonographers and reduce variability in scan acquisitions, we propose a novel ultrasound (US) navigation assistance method based on contrastive learning as goal-conditioned reinforcement learning (GCRL). We augment the previous framework using a novel contrastive patient batching method (CPB) and a data-augmented contrastive loss, both of which we demonstrate are essential to ensure generalization to anatomical variations across patients. The proposed framework enables navigation to both standard diagnostic as well as intricate interventional views with a single model. Our method was developed with a large dataset of 789 patients and obtained an average error of 6.56 mm in position and 9.36 degrees in angle on a testing dataset of 140 patients, which is competitive or superior to models trained on individual views. Furthermore, we quantitatively validate our method's ability to navigate to interventional views such as the Left Atrial Appendage (LAA) view used in LAA closure. Our approach holds promise in providing valuable guidance during transesophageal ultrasound examinations, contributing to the advancement of skill acquisition for cardiac ultrasound practitioners.

[60]  arXiv:2405.01413 [pdf, other]
Title: MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
Comments: 17 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.

[61]  arXiv:2405.01434 [pdf, other]
Title: StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

[62]  arXiv:2405.01439 [pdf, other]
Title: Improving Domain Generalization on Gaze Estimation via Branch-out Auxiliary Regularization
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.

[63]  arXiv:2405.01461 [pdf, other]
Title: SATO: Stable Text-to-Motion Framework
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

[64]  arXiv:2405.01469 [pdf, other]
Title: Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.

[65]  arXiv:2405.01483 [pdf, other]
Title: MANTIS: Interleaved Multi-Image Instruction Tuning
Comments: 9 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective. In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets. We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding. We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs. We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks. Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points. We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2. Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.

[66]  arXiv:2405.01494 [pdf, other]
Title: Navigating Heterogeneity and Privacy in One-Shot Federated Learning with Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

Federated learning (FL) enables multiple clients to train models collectively while preserving data privacy. However, FL faces challenges in terms of communication cost and data heterogeneity. One-shot federated learning has emerged as a solution by reducing communication rounds, improving efficiency, and providing better security against eavesdropping attacks. Nevertheless, data heterogeneity remains a significant challenge, impacting performance. This work explores the effectiveness of diffusion models in one-shot FL, demonstrating their applicability in addressing data heterogeneity and improving FL performance. Additionally, we investigate the utility of our diffusion model approach, FedDiff, compared to other one-shot FL methods under differential privacy (DP). Furthermore, to improve generated sample quality under DP settings, we propose a pragmatic Fourier Magnitude Filtering (FMF) method, enhancing the effectiveness of generated data for global model training.

[67]  arXiv:2405.01496 [pdf, other]
Title: LocInv: Localization-aware Inversion for Text-Guided Image Editing
Comments: Accepted by CVPR 2024 Workshop AI4CC
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL

[68]  arXiv:2405.01521 [pdf, other]
Title: Transformer-Aided Semantic Communications
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)

The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.

[69]  arXiv:2405.01533 [pdf, other]
Title: OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.

[70]  arXiv:2405.01536 [pdf, other]
Title: Customizing Text-to-Image Models with a Single Image Pair
Comments: project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.

[71]  arXiv:2405.01538 [pdf, other]
Title: Multi-Space Alignments Towards Universal LiDAR Segmentation
Comments: CVPR 2024; 33 pages, 14 figures, 14 tables; Code at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

A unified and versatile LiDAR segmentation model with strong robustness and generalizability is desirable for safe autonomous driving perception. This work presents M3Net, a one-of-a-kind framework for fulfilling multi-task, multi-dataset, multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. To better exploit data volume and diversity, we first combine large-scale driving datasets acquired by different types of sensors from diverse scenes and then conduct alignments in three spaces, namely data, feature, and label spaces, during the training. As a result, M3Net is capable of taming heterogeneous data for training state-of-the-art LiDAR segmentation models. Extensive experiments on twelve LiDAR segmentation datasets verify our effectiveness. Notably, using a shared set of parameters, M3Net achieves 75.1%, 83.1%, and 72.4% mIoU scores, respectively, on the official benchmarks of SemanticKITTI, nuScenes, and Waymo Open.

Cross-lists for Fri, 3 May 24

[72]  arXiv:2405.00682 (cross-list from eess.SP) [pdf, ps, other]
Title: SynthBrainGrow: Synthetic Diffusion Brain Aging for Longitudinal MRI Data Generation in Young People
Comments: 8 pages, 4 figures
Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Synthetic longitudinal brain MRI simulates brain aging and would enable more efficient research on neurodevelopmental and neurodegenerative conditions. Synthetically generated, age-adjusted brain images could serve as valuable alternatives to costly longitudinal imaging acquisitions, serve as internal controls for studies looking at the effects of environmental or therapeutic modifiers on brain development, and allow data augmentation for diverse populations. In this paper, we present a diffusion-based approach called SynthBrainGrow for synthetic brain aging with a two-year step. To validate the feasibility of using synthetically-generated data on downstream tasks, we compared structural volumetrics of two-year-aged brains against synthetically-aged brain MRI. Results show that SynthBrainGrow can accurately capture substructure volumetrics and simulate structural changes such as ventricle enlargement and cortical thinning. Our approach provides a novel way to generate longitudinal brain datasets from cross-sectional data to enable augmented training and benchmarking of computational tools for analyzing lifespan trajectories. This work signifies an important advance in generative modeling to synthesize realistic longitudinal data with limited lifelong MRI scans. The code is available at XXX.

[73]  arXiv:2405.00685 (cross-list from cs.RO) [pdf, ps, other]
Title: The active visual sensing methods for robotic welding: review, tutorial and prospect
Authors: ZhenZhou Wang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

The visual sensing system is one of the most important parts of the welding robots to realize intelligent and autonomous welding. The active visual sensing methods have been widely adopted in robotic welding because of their higher accuracies compared to the passive visual sensing methods. In this paper, we give a comprehensive review of the active visual sensing methods for robotic welding. According to their uses, we divide the state-of-the-art active visual sensing methods into four categories: seam tracking, weld bead defect detection, 3D weld pool geometry measurement and welding path planning. Firstly, we review the principles of these active visual sensing methods. Then, we give a tutorial of the 3D calibration methods for the active visual sensing systems used in intelligent welding robots to fill the gaps in the related fields. At last, we compare the reviewed active visual sensing methods and give the prospects based on their advantages and disadvantages.

[74]  arXiv:2405.00686 (cross-list from cs.NE) [pdf, ps, other]
Title: Technical Report on BaumEvA Evolutionary Optimization Python-Library Testing
Comments: The paper consists of 30 pages, 37 figures, 5 tables
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

This report presents the test results Python library BaumEvA, which implements evolutionary algorithms for optimizing various types of problems, including computer vision tasks accompanied by the search for optimal model architectures. Testing was carried out to evaluate the effectiveness and reliability of the pro-posed methods, as well as to determine their applicability in various fields. Dur-ing testing, various test functions and parameters of evolutionary algorithms were used, which made it possible to evaluate their performance in a wide range of conditions. Test results showed that the library provides effective and reliable methods for solving optimization problems. However, some limitations were identified related to computational resources and execution time of algorithms on problems with large dimensions. The report includes a detailed description of the tests performed, the results obtained and conclusions about the applicability of the genetic algorithm in various tasks. Recommendations for choosing algorithm pa-rameters and using the library to achieve the best results are also provided. The report may be useful to developers involved in the optimization of complex com-puting systems, as well as to researchers studying the possibilities of using evo-lutionary algorithms in various fields of science and technology.

[75]  arXiv:2405.00739 (cross-list from cs.LG) [pdf, other]
Title: Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. By increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data augmentation fosters a broader perspective provided by the divergent teacher ensemble and lower student-teacher mutual information, benefiting generalization performance. These insights clarify the mechanism on low-fidelity phenomenon in KD. Thus, we offer new perspectives on optimizing student model performance, by emphasizing increased diversity in teacher attentions and reduced mimicry behavior between teachers and student.

[76]  arXiv:2405.00797 (cross-list from cs.RO) [pdf, other]
Title: ADM: Accelerated Diffusion Model via Estimated Priors for Robust Motion Prediction under Uncertainties
Comments: 7 pages, 4 figures
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Motion prediction is a challenging problem in autonomous driving as it demands the system to comprehend stochastic dynamics and the multi-modal nature of real-world agent interactions. Diffusion models have recently risen to prominence, and have proven particularly effective in pedestrian motion prediction tasks. However, the significant time consumption and sensitivity to noise have limited the real-time predictive capability of diffusion models. In response to these impediments, we propose a novel diffusion-based, acceleratable framework that adeptly predicts future trajectories of agents with enhanced resistance to noise. The core idea of our model is to learn a coarse-grained prior distribution of trajectory, which can skip a large number of denoise steps. This advancement not only boosts sampling efficiency but also maintains the fidelity of prediction accuracy. Our method meets the rigorous real-time operational standards essential for autonomous vehicles, enabling prompt trajectory generation that is vital for secure and efficient navigation. Through extensive experiments, our method speeds up the inference time to 136ms compared to standard diffusion model, and achieves significant improvement in multi-agent motion prediction on the Argoverse 1 motion forecasting dataset.

[77]  arXiv:2405.00956 (cross-list from cs.RO) [pdf, other]
Title: Efficient Data-driven Scene Simulation using Robotic Surgery Videos via Physics-embedded 3D Gaussians
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Surgical scene simulation plays a crucial role in surgical education and simulator-based robot learning. Traditional approaches for creating these environments with surgical scene involve a labor-intensive process where designers hand-craft tissues models with textures and geometries for soft body simulations. This manual approach is not only time-consuming but also limited in the scalability and realism. In contrast, data-driven simulation offers a compelling alternative. It has the potential to automatically reconstruct 3D surgical scenes from real-world surgical video data, followed by the application of soft body physics. This area, however, is relatively uncharted. In our research, we introduce 3D Gaussian as a learnable representation for surgical scene, which is learned from stereo endoscopic video. To prevent over-fitting and ensure the geometrical correctness of these scenes, we incorporate depth supervision and anisotropy regularization into the Gaussian learning process. Furthermore, we apply the Material Point Method, which is integrated with physical properties, to the 3D Gaussians to achieve realistic scene deformations. Our method was evaluated on our collected in-house and public surgical videos datasets. Results show that it can reconstruct and simulate surgical scenes from endoscopic videos efficiently-taking only a few minutes to reconstruct the surgical scene-and produce both visually and physically plausible deformations at a speed approaching real-time. The results demonstrate great potential of our proposed method to enhance the efficiency and variety of simulations available for surgical education and robot learning.

[78]  arXiv:2405.00980 (cross-list from cs.CL) [pdf, other]
Title: A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News
Comments: Accepted by LREC-COLING 2024
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL) dataset collected from a TV news program over a period of 7 months. The dataset is collected to enrich resources for HKSL and support research in large-vocabulary continuous sign language recognition (SLR) and translation (SLT). It consists of 16.07 hours of sign videos of two signers with a vocabulary of 6,515 glosses (for SLR) and 2,850 Chinese characters or 18K Chinese words (for SLT). One signer has 11.66 hours of sign videos and the other has 4.41 hours. One objective in building the dataset is to support the investigation of how well large-vocabulary continuous sign language recognition/translation can be done for a single signer given a (relatively) large amount of his/her training data, which could potentially lead to the development of new modeling methods. Besides, most parts of the data collection pipeline are automated with little human intervention; we believe that our collection method can be scaled up to collect more sign language data easily for SLT in the future for any sign languages if such sign-interpreted videos are available. We also run a SOTA SLR/SLT model on the dataset and get a baseline SLR word error rate of 34.08% and a baseline SLT BLEU-4 score of 23.58 for benchmarking future research on the dataset.

[79]  arXiv:2405.00984 (cross-list from cs.LG) [pdf, other]
Title: FREE: Faster and Better Data-Free Meta-Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data, presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However, they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges, we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework, which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically, within the module Faster Inversion via Meta-Generator, each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps, significantly accelerating the data recovery. Furthermore, we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach, marking a notable speed-up (20$\times$) and performance enhancement (1.42\% $\sim$ 4.78\%) in comparison to the state-of-the-art.

[80]  arXiv:2405.01004 (cross-list from cs.SD) [pdf, ps, other]
Title: Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [https://github.com/zzadiues3338/ASR-energy-jetson].

[81]  arXiv:2405.01012 (cross-list from q-bio.NC) [pdf, other]
Title: Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks
Comments: ICLR 2024 Re-Align Workshop
Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

Centred Kernel Alignment (CKA) has recently emerged as a popular metric to compare activations from biological and artificial neural networks (ANNs) in order to quantify the alignment between internal representations derived from stimuli sets (e.g. images, text, video) that are presented to both systems. In this paper we highlight issues that the community should take into account if using CKA as an alignment metric with neural data. Neural data are in the low-data high-dimensionality domain, which is one of the cases where (biased) CKA results in high similarity scores even for pairs of random matrices. Using fMRI and MEG data from the THINGS project, we show that if biased CKA is applied to representations of different sizes in the low-data high-dimensionality domain, they are not directly comparable due to biased CKA's sensitivity to differing feature-sample ratios and not stimuli-driven responses. This situation can arise both when comparing a pre-selected area of interest (e.g. ROI) to multiple ANN layers, as well as when determining to which ANN layer multiple regions of interest (ROIs) / sensor groups of different dimensionality are most similar. We show that biased CKA can be artificially driven to its maximum value when using independent random data of different sample-feature ratios. We further show that shuffling sample-feature pairs of real neural data does not drastically alter biased CKA similarity in comparison to unshuffled data, indicating an undesirable lack of sensitivity to stimuli-driven neural responses. Positive alignment of true stimuli-driven responses is only achieved by using debiased CKA. Lastly, we report findings that suggest biased CKA is sensitive to the inherent structure of neural data, only differing from shuffled data when debiased CKA detects stimuli-driven alignment.

[82]  arXiv:2405.01054 (cross-list from cs.RO) [pdf, other]
Title: Continual Learning for Robust Gate Detection under Dynamic Lighting in Autonomous Drone Racing
Comments: 8 pages, 6 figures, in 2024 International Joint Conference on Neural Networks (IJCNN)
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In autonomous and mobile robotics, a principal challenge is resilient real-time environmental perception, particularly in situations characterized by unknown and dynamic elements, as exemplified in the context of autonomous drone racing. This study introduces a perception technique for detecting drone racing gates under illumination variations, which is common during high-speed drone flights. The proposed technique relies upon a lightweight neural network backbone augmented with capabilities for continual learning. The envisaged approach amalgamates predictions of the gates' positional coordinates, distance, and orientation, encapsulating them into a cohesive pose tuple. A comprehensive number of tests serve to underscore the efficacy of this approach in confronting diverse and challenging scenarios, specifically those involving variable lighting conditions. The proposed methodology exhibits notable robustness in the face of illumination variations, thereby substantiating its effectiveness.

[83]  arXiv:2405.01060 (cross-list from cs.LG) [pdf, other]
Title: A text-based, generative deep learning model for soil reflectance spectrum simulation in the VIS-NIR (400-2499 nm) bands
Comments: The paper has been submitted to Remote sensing of Environment and revised
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Simulating soil reflectance spectra is invaluable for soil-plant radiative modeling and training machine learning models, yet it is difficult as the intricate relationships between soil structure and its constituents. To address this, a fully data-driven soil optics generative model (SOGM) for simulation of soil reflectance spectra based on soil property inputs was developed. The model is trained on an extensive dataset comprising nearly 180,000 soil spectra-property pairs from 17 datasets. It generates soil reflectance spectra from text-based inputs describing soil properties and their values rather than only numerical values and labels in binary vector format. The generative model can simulate output spectra based on an incomplete set of input properties. SOGM is based on the denoising diffusion probabilistic model (DDPM). Two additional sub-models were also built to complement the SOGM: a spectral padding model that can fill in the gaps for spectra shorter than the full visible-near-infrared range (VIS-NIR; 400 to 2499 nm), and a wet soil spectra model that can estimate the effects of water content on soil reflectance spectra given the dry spectrum predicted by the SOGM. The SOGM was up-scaled by coupling with the Helios 3D plant modeling software, which allowed for generation of synthetic aerial images of simulated soil and plant scenes. It can also be easily integrated with soil-plant radiation model used for remote sensin research like PROSAIL. The testing results of the SOGM on new datasets that not included in model training proved that the model can generate reasonable soil reflectance spectra based on available property inputs. The presented models are openly accessible on: https://github.com/GEMINI-Breeding/SOGM_soil_spectra_simulation.

[84]  arXiv:2405.01073 (cross-list from cs.LG) [pdf, other]
Title: Poisoning Attacks on Federated Learning for Autonomous Driving
Comments: Accepted to SCAI2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Federated Learning (FL) is a decentralized learning paradigm, enabling parties to collaboratively train models while keeping their data confidential. Within autonomous driving, it brings the potential of reducing data storage costs, reducing bandwidth requirements, and to accelerate the learning. FL is, however, susceptible to poisoning attacks. In this paper, we introduce two novel poisoning attacks on FL tailored to regression tasks within autonomous driving: FLStealth and Off-Track Attack (OTA). FLStealth, an untargeted attack, aims at providing model updates that deteriorate the global model performance while appearing benign. OTA, on the other hand, is a targeted attack with the objective to change the global model's behavior when exposed to a certain trigger. We demonstrate the effectiveness of our attacks by conducting comprehensive experiments pertaining to the task of vehicle trajectory prediction. In particular, we show that, among five different untargeted attacks, FLStealth is the most successful at bypassing the considered defenses employed by the server. For OTA, we demonstrate the inability of common defense strategies to mitigate the attack, highlighting the critical need for new defensive mechanisms against targeted attacks within FL for autonomous driving.

[85]  arXiv:2405.01124 (cross-list from stat.ML) [pdf, other]
Title: Investigating Self-Supervised Image Denoising with Denaturation
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Statistics Theory (math.ST)

Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth through theoretical analysis and numerical experiments. Through the theoretical analysis, we discuss that the algorithm finds desired solutions to the optimization problem with the population risk, while the guarantee for the empirical risk depends on the hardness of the denoising task in terms of denaturation levels. We also conduct several experiments to investigate the performance of an extended algorithm in practice. The results indicate that the algorithm training with denatured images works, and the empirical performance aligns with the theoretical results. These results suggest several insights for further improvement of self-supervised image denoising that uses denatured data in future directions.

[86]  arXiv:2405.01192 (cross-list from cs.RO) [pdf, other]
Title: Imagine2touch: Predictive Tactile Sensing for Robotic Manipulation using Efficient Low-Dimensional Signals
Comments: 3 pages, 3 figures, 2 tables, accepted at ViTac2024 ICRA2024 Workshop. arXiv admin note: substantial text overlap with arXiv:2403.15107
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term Imagine2touch. Imagine2touch aims to predict the expected touch signal based on a visual patch representing the area to be touched. We use ReSkin, an inexpensive and compact touch sensor to collect the required dataset through random touching of five basic geometric shapes, and one tool. We train Imagine2touch on two out of those shapes and validate it on the ood. tool. We demonstrate the efficacy of Imagine2touch through its application to the downstream task of object recognition. In this task, we evaluate Imagine2touch performance in two experiments, together comprising 5 out of training distribution objects. Imagine2touch achieves an object recognition accuracy of 58% after ten touches per object, surpassing a proprioception baseline.

[87]  arXiv:2405.01205 (cross-list from cs.LG) [pdf, other]
Title: Error-Driven Uncertainty Aware Training
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Neural networks are often overconfident about their predictions, which undermines their reliability and trustworthiness. In this work, we present a novel technique, named Error-Driven Uncertainty Aware Training (EUAT), which aims to enhance the ability of neural models to estimate their uncertainty correctly, namely to be highly uncertain when they output inaccurate predictions and low uncertain when their output is accurate. The EUAT approach operates during the model's training phase by selectively employing two loss functions depending on whether the training examples are correctly or incorrectly predicted by the model. This allows for pursuing the twofold goal of i) minimizing model uncertainty for correctly predicted inputs and ii) maximizing uncertainty for mispredicted inputs, while preserving the model's misprediction rate. We evaluate EUAT using diverse neural models and datasets in the image recognition domains considering both non-adversarial and adversarial settings. The results show that EUAT outperforms existing approaches for uncertainty estimation (including other uncertainty-aware training techniques, calibration, ensembles, and DEUP) by providing uncertainty estimates that not only have higher quality when evaluated via statistical metrics (e.g., correlation with residuals) but also when employed to build binary classifiers that decide whether the model's output can be trusted or not and under distributional data shifts.

[88]  arXiv:2405.01333 (cross-list from cs.RO) [pdf, other]
Title: NeRF in Robotics: A Survey
Comments: 21 pages, 19 figures
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Meticulous 3D environment representations have been a longstanding goal in computer vision and robotics fields. The recent emergence of neural implicit representations has introduced radical innovation to this field as implicit representations enable numerous capabilities. Among these, the Neural Radiance Field (NeRF) has sparked a trend because of the huge representational advantages, such as simplified mathematical models, compact environment storage, and continuous scene representations. Apart from computer vision, NeRF has also shown tremendous potential in the field of robotics. Thus, we create this survey to provide a comprehensive understanding of NeRF in the field of robotics. By exploring the advantages and limitations of NeRF, as well as its current applications and future potential, we hope to shed light on this promising area of research. Our survey is divided into two main sections: \textit{The Application of NeRF in Robotics} and \textit{The Advance of NeRF in Robotics}, from the perspective of how NeRF enters the field of robotics. In the first section, we introduce and analyze some works that have been or could be used in the field of robotics from the perception and interaction perspectives. In the second section, we show some works related to improving NeRF's own properties, which are essential for deploying NeRF in the field of robotics. In the discussion section of the review, we summarize the existing challenges and provide some valuable future research directions for reference.

[89]  arXiv:2405.01460 (cross-list from cs.CR) [pdf, other]
Title: Purify Unlearnable Examples via Rate-Constrained Variational Autoencoders
Comments: Accepted by ICML 2024
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Unlearnable examples (UEs) seek to maximize testing error by making subtle modifications to training examples that are correctly labeled. Defenses against these poisoning attacks can be categorized based on whether specific interventions are adopted during training. The first approach is training-time defense, such as adversarial training, which can mitigate poisoning effects but is computationally intensive. The other approach is pre-training purification, e.g., image short squeezing, which consists of several simple compressions but often encounters challenges in dealing with various UEs. Our work provides a novel disentanglement mechanism to build an efficient pre-training purification method. Firstly, we uncover rate-constrained variational autoencoders (VAEs), demonstrating a clear tendency to suppress the perturbations in UEs. We subsequently conduct a theoretical analysis for this phenomenon. Building upon these insights, we introduce a disentangle variational autoencoder (D-VAE), capable of disentangling the perturbations with learnable class-wise embeddings. Based on this network, a two-stage purification approach is naturally developed. The first stage focuses on roughly eliminating perturbations, while the second stage produces refined, poison-free results, ensuring effectiveness and robustness across various scenarios. Extensive experiments demonstrate the remarkable performance of our method across CIFAR-10, CIFAR-100, and a 100-class ImageNet-subset. Code is available at https://github.com/yuyi-sd/D-VAE.

[90]  arXiv:2405.01468 (cross-list from cs.LG) [pdf, other]
Title: Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Authors: Yifei Ming, Yixuan Li
Comments: The paper is accepted at ICML 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.

[91]  arXiv:2405.01474 (cross-list from cs.CL) [pdf, other]
Title: V-FLUTE: Visual Figurative Language Understanding with Textual Explanations
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.

[92]  arXiv:2405.01503 (cross-list from eess.IV) [pdf, other]
Title: PAM-UNet: Shifting Attention on Region of Interest in Medical Images
Comments: Accepted at 2024 IEEE EMBC
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.

[93]  arXiv:2405.01524 (cross-list from cs.LG) [pdf, other]
Title: A separability-based approach to quantifying generalization: which layer is best?
Comments: 6, pages, 5 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Generalization to unseen data remains poorly understood for deep learning classification and foundation models. How can one assess the ability of networks to adapt to new or extended versions of their input space in the spirit of few-shot learning, out-of-distribution generalization, and domain adaptation? Which layers of a network are likely to generalize best? We provide a new method for evaluating the capacity of networks to represent a sampled domain, regardless of whether the network has been trained on all classes in the domain. Our approach is the following: after fine-tuning state-of-the-art pre-trained models for visual classification on a particular domain, we assess their performance on data from related but distinct variations in that domain. Generalization power is quantified as a function of the latent embeddings of unseen data from intermediate layers for both unsupervised and supervised settings. Working throughout all stages of the network, we find that (i) high classification accuracy does not imply high generalizability; and (ii) deeper layers in a model do not always generalize the best, which has implications for pruning. Since the trends observed across datasets are largely consistent, we conclude that our approach reveals (a function of) the intrinsic capacity of the different layers of a model to generalize.

[94]  arXiv:2405.01527 (cross-list from cs.RO) [pdf, other]
Title: Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation
Comments: preprint
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables zero-shot robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/

[95]  arXiv:2405.01531 (cross-list from cs.LG) [pdf, other]
Title: Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions. Crucially, the CBM design inherently allows for human interventions, in which expert users are given the ability to modify potentially misaligned concept choices to influence the decision behavior of the model in an interpretable fashion. However, existing approaches often require numerous human interventions per image to achieve strong performances, posing practical challenges in scenarios where obtaining human feedback is expensive. In this paper, we find that this is noticeably driven by an independent treatment of concepts during intervention, wherein a change of one concept does not influence the use of other ones in the model's final decision. To address this issue, we introduce a trainable concept intervention realignment module, which leverages concept relations to realign concept assignments post-intervention. Across standard, real-world benchmarks, we find that concept realignment can significantly improve intervention efficacy; significantly reducing the number of interventions needed to reach a target classification performance or concept prediction accuracy. In addition, it easily integrates into existing concept-based architectures without requiring changes to the models themselves. This reduced cost of human-model collaboration is crucial to enhancing the feasibility of CBMs in resource-constrained environments.

[96]  arXiv:2405.01534 (cross-list from cs.LG) [pdf, other]
Title: Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
Comments: Published at ICLR 2024. Website at this https URL 9 pages, 3 figures, 3 tables; 14 pages appendix (7 additional figures)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/

Replacements for Fri, 3 May 24

[97]  arXiv:2009.08618 (replaced) [pdf, other]
Title: 6-DoF Grasp Planning using Fast 3D Reconstruction and Grasp Quality CNN
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[98]  arXiv:2209.10368 (replaced) [pdf, other]
Title: USC: Uncompromising Spatial Constraints for Safety-Oriented 3D Object Detectors in Autonomous Driving
Comments: 8 pages (IEEE double column format), 7 figures, 2 tables, submitted to ITSC 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[99]  arXiv:2303.01917 (replaced) [pdf, other]
Title: Pyramid Pixel Context Adaption Network for Medical Image Classification with Supervised Contrastive Learning
Comments: 15 pages
Journal-ref: TNNLS 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[100]  arXiv:2303.06464 (replaced) [pdf, other]
Title: PARASOL: Parametric Style Control for Diffusion Image Synthesis
Comments: Camera-ready version
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[101]  arXiv:2304.06027 (replaced) [pdf, other]
Title: Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA
Comments: Transactions on Machine Learning Research (TMLR) 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[102]  arXiv:2305.00194 (replaced) [pdf, other]
Title: Searching from Area to Point: A Hierarchical Framework for Semantic-Geometric Combined Feature Matching
Comments: v3
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[103]  arXiv:2305.10110 (replaced) [pdf, ps, other]
Title: Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[104]  arXiv:2305.14014 (replaced) [pdf, other]
Title: CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Comments: Preprint. A PyTorch re-implementation is at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[105]  arXiv:2305.16965 (replaced) [pdf, other]
Title: Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling
Comments: full version; IJCAI 2024 accepted (main track)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[106]  arXiv:2307.06065 (replaced) [pdf, other]
Title: Operational Support Estimator Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[107]  arXiv:2307.12900 (replaced) [pdf, other]
Title: Automotive Object Detection via Learning Sparse Events by Spiking Neurons
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[108]  arXiv:2308.12605 (replaced) [pdf, other]
Title: APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[109]  arXiv:2308.15692 (replaced) [pdf, other]
Title: Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
[110]  arXiv:2309.00733 (replaced) [pdf, other]
Title: TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models
Comments: Accepted to ICLR 2024, Reliable and Responsible Foundation Models workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[111]  arXiv:2309.08152 (replaced) [pdf, other]
Title: DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions
Comments: Accepted to ICRA 2024. Our project website can be found at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[112]  arXiv:2309.17105 (replaced) [pdf, other]
Title: Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution Modeling
Comments: 16 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[113]  arXiv:2310.02067 (replaced) [pdf, other]
Title: Content Bias in Deep Learning Image Age Approximation: A new Approach Towards better Explainability
Comments: This is a preprint, the paper is currently under consideration at Pattern Recognition Letters
Journal-ref: Pattern Recognition Letters 182 (2024) 90-96
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[114]  arXiv:2310.17626 (replaced) [pdf, ps, other]
Title: A Survey on Transferability of Adversarial Examples across Deep Neural Networks
Comments: Accepted to Transactions on Machine Learning Research (TMLR)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[115]  arXiv:2311.10543 (replaced) [pdf, other]
Title: Joint covariance properties under geometric image transformations for spatio-temporal receptive fields according to the generalized Gaussian derivative model for visual receptive fields
Authors: Tony Lindeberg
Comments: 38 pages, 13 figures. Note: From version 4, this paper considers a different form of joint composition of the geometric image transformations than in the earlier versions
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
[116]  arXiv:2311.18576 (replaced) [pdf, other]
Title: Fingerprint Matching with Localized Deep Representation
Comments: 18 pages, 20 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[117]  arXiv:2312.13299 (replaced) [pdf, other]
Title: Compact 3D Scene Representation via Self-Organizing Gaussian Grids
Comments: Added compression of spherical harmonics, updated compression method with improved results (all attributes compressed with JPEG XL now), added qualitative comparison of additional scenes, moved compression explanation and comparison to main paper, added comparison with "Making Gaussian Splats smaller"
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[118]  arXiv:2401.00254 (replaced) [pdf, other]
Title: Morphing Tokens Draw Strong Masked Image Models
Comments: 27 pages, 17 tables, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[119]  arXiv:2401.17728 (replaced) [pdf, other]
Title: COMET: Contrastive Mean Teacher for Online Source-Free Universal Domain Adaptation
Comments: Accepted at the International Joint Conference on Neural Networks (IJCNN) 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[120]  arXiv:2402.02972 (replaced) [pdf, other]
Title: Retrieval-Augmented Score Distillation for Text-to-3D Generation
Comments: Accepted to ICML 2024 / Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[121]  arXiv:2402.04930 (replaced) [pdf, other]
Title: Blue noise for diffusion models
Comments: SIGGRAPH 2024 Conference Proceedings; Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
[122]  arXiv:2402.05374 (replaced) [pdf, other]
Title: CIC: A framework for Culturally-aware Image Captioning
Comments: Accepted in IJCAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[123]  arXiv:2402.14095 (replaced) [pdf, other]
Title: Zero-shot generalization across architectures for visual classification
Comments: Accepted as a Tiny Paper at ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[124]  arXiv:2402.19473 (replaced) [pdf, other]
Title: Retrieval-Augmented Generation for AI-Generated Content: A Survey
Comments: Citing 334 papers, 21 pages, 1 table, 12 figures. Project: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[125]  arXiv:2403.07741 (replaced) [pdf, other]
Title: Uncertainty Quantification with Deep Ensembles for 6D Object Pose Estimation
Comments: 8 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[126]  arXiv:2403.20183 (replaced) [pdf, other]
Title: HARMamba: Efficient Wearable Sensor Human Activity Recognition Based on Bidirectional Selective SSM
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[127]  arXiv:2404.01030 (replaced) [pdf, ps, other]
Title: Survey of Bias In Text-to-Image Generation: Definition, Evaluation, and Mitigation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
[128]  arXiv:2404.10966 (replaced) [pdf, other]
Title: Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation
Comments: Accepted at CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[129]  arXiv:2404.12388 (replaced) [pdf, other]
Title: VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Comments: project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[130]  arXiv:2404.13320 (replaced) [pdf, other]
Title: Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We Think
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[131]  arXiv:2404.17230 (replaced) [pdf, other]
Title: ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
Comments: 12 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[132]  arXiv:2404.17883 (replaced) [pdf, other]
Title: Underwater Variable Zoom: Depth-Guided Perception Network for Underwater Image Enhancement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[133]  arXiv:2404.18253 (replaced) [pdf, other]
Title: Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment
Authors: Tengjun Huang
Comments: Accepted by the Twelfth International Conference on Learning Representations (ICLR) Workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[134]  arXiv:2404.18895 (replaced) [pdf, other]
Title: RSCaMa: Remote Sensing Image Change Captioning with State Space Model
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[135]  arXiv:2404.19227 (replaced) [pdf, other]
Title: Espresso: Robust Concept Filtering in Text-to-Image Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
[136]  arXiv:2404.19444 (replaced) [pdf, other]
Title: AnomalyXFusion: Multi-modal Anomaly Synthesis with Diffusion
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[137]  arXiv:2211.09325 (replaced) [pdf, other]
Title: TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation
Comments: Conference on Robot Learning (CoRL), 2022. Supplementary material is available at this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[138]  arXiv:2311.09253 (replaced) [pdf, other]
Title: The Perception-Robustness Tradeoff in Deterministic Image Restoration
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
[139]  arXiv:2312.17183 (replaced) [pdf, other]
Title: One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts
Comments: 53 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[140]  arXiv:2402.09466 (replaced) [pdf, other]
Title: Few-Shot Learning with Uncertainty-based Quadruplet Selection for Interference Classification in GNSS Data
Journal-ref: 979-8-3503-8078-1/24/$31.00 \c{opyright}2024 IEEE
Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[141]  arXiv:2402.13505 (replaced) [pdf, other]
Title: SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning
Comments: ICML2024
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[142]  arXiv:2403.16071 (replaced) [pdf, other]
Title: Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
Comments: To appear in LREC-COLING 2024
Journal-ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[143]  arXiv:2404.04916 (replaced) [pdf, other]
Title: Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder
Comments: Accepted by ICML 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[144]  arXiv:2404.15918 (replaced) [pdf, other]
Title: Perception and Localization of Macular Degeneration Applying Convolutional Neural Network, ResNet and Grad-CAM
Comments: 12 pages, 5 figures, 2 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[145]  arXiv:2404.19398 (replaced) [pdf, other]
Title: 3D Gaussian Blendshapes for Head Avatar Animation
Comments: ACM SIGGRAPH Conference Proceedings 2024
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
[146]  arXiv:2405.00142 (replaced) [pdf, other]
Title: Utilizing Machine Learning and 3D Neuroimaging to Predict Hearing Loss: A Comparative Analysis of Dimensionality Reduction and Regression Techniques
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[ total of 146 entries: 1-146 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2405, contact, help  (Access key information)