Image and Video Processing
See recent articles
- [1] arXiv:2408.11965 [pdf, html, other]
-
Title: CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT VolumesComments: 15 pages, 9 figures, submitted to ISBI 2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The rapid increase of computed tomography (CT) scans and their time-consuming manual analysis have created an urgent need for robust automated analysis techniques in clinical settings. These aim to assist radiologists and help them managing their growing workload. Existing methods typically generate entire reports directly from 3D CT images, without explicitly focusing on observed abnormalities. This unguided approach often results in repetitive content or incomplete reports, failing to prioritize anomaly-specific descriptions. We propose a new anomaly-guided report generation model, which first predicts abnormalities and then generates targeted descriptions for each. Evaluation on a public dataset demonstrates significant improvements in report quality and clinical relevance. We extend our work by conducting an ablation study to demonstrate its effectiveness.
- [2] arXiv:2408.11982 [pdf, html, other]
-
Title: AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and ResultsMaksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Xiaoheng Tan, Haiqiang Wang, Xiaozhong Xu, Shan Liu, Zhenzhong Chen, Zhengxue Cheng, Jiahao Xiao, Jun Xu, Chenlong He, Qi Zheng, Ruoxi Zhu, Min Li, Yibo Fan, Zhengzhong TuSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dataset of 459 videos, encoded with 14 codecs of various compression standards (AVC/H.264, HEVC/H.265, AV1, and VVC/H.266) and containing a comprehensive collection of compression artifacts. To measure the methods performance, we employed traditional correlation coefficients between their predictions and subjective scores, which were collected via large-scale crowdsourced pairwise human comparisons. For training purposes, participants were provided with the Compressed Video Quality Assessment Dataset (CVQAD), a previously developed dataset of 1022 videos. Up to 30 participating teams registered for the challenge, while we report the results of 6 teams, which submitted valid final solutions and code for reproducing the results. Moreover, we calculated and present the performance of state-of-the-art VQA methods on the developed dataset, providing a comprehensive benchmark for future research. The dataset, results, and online leaderboard are publicly available at this https URL.
- [3] arXiv:2408.11992 [pdf, html, other]
-
Title: MBSS-T1: Model-Based Self-Supervised Motion Correction for Robust Cardiac T1 MappingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
T1 mapping is a valuable quantitative MRI technique for diagnosing diffuse myocardial diseases. Traditional methods, relying on breath-hold sequences and echo triggering, face challenges with patient compliance and arrhythmias, limiting their effectiveness. Image registration can enable motion-robust T1 mapping, but inherent intensity differences between time points pose a challenge. We introduce MBSS-T1, a self-supervised model for motion correction in cardiac T1 mapping, constrained by physical and anatomical principles. The physical constraints ensure expected signal decay behavior, while the anatomical constraints maintain realistic deformations. The unique combination of these constraints ensures accurate T1 mapping along the longitudinal relaxation axis. MBSS-T1 outperformed baseline deep-learning-based image registration approaches in a 5-fold experiment on a public dataset of 210 patients (STONE sequence) and an internal dataset of 19 patients (MOLLI sequence). MBSS-T1 excelled in model fitting quality (R2: 0.974 vs. 0.941, 0.946), anatomical alignment (Dice score: 0.921 vs. 0.984, 0.988), and expert visual quality assessment for the presence of visible motion artifacts (4.33 vs. 3.34, 3.62). MBSS-T1 has the potential to enable motion-robust T1 mapping for a broader range of patients, overcoming challenges such as arrhythmias, and suboptimal compliance, and allowing for free-breathing T1 mapping without requiring large training datasets.
- [4] arXiv:2408.12013 [pdf, html, other]
-
Title: Detection of Under-represented Samples Using Dynamic Batch Training for Brain Tumor Segmentation from MR ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Brain tumors in magnetic resonance imaging (MR) are difficult, time-consuming, and prone to human error. These challenges can be resolved by developing automatic brain tumor segmentation methods from MR images. Various deep-learning models based on the U-Net have been proposed for the task. These deep-learning models are trained on a dataset of tumor images and then used for segmenting the masks. Mini-batch training is a widely used method in deep learning for training. However, one of the significant challenges associated with this approach is that if the training dataset has under-represented samples or samples with complex latent representations, the model may not generalize well to these samples. The issue leads to skewed learning of the data, where the model learns to fit towards the majority representations while underestimating the under-represented samples. The proposed dynamic batch training method addresses the challenges posed by under-represented data points, data points with complex latent representation, and imbalances within the class, where some samples may be harder to learn than others. Poor performance of such samples can be identified only after the completion of the training, leading to the wastage of computational resources. Also, training easy samples after each epoch is an inefficient utilization of computation resources. To overcome these challenges, the proposed method identifies hard samples and trains such samples for more iterations compared to easier samples on the BraTS2020 dataset. Additionally, the samples trained multiple times are identified and it provides a way to identify hard samples in the BraTS2020 dataset. The comparison of the proposed training approach with U-Net and other models in the literature highlights the capabilities of the proposed training approach.
- [5] arXiv:2408.12150 [pdf, html, other]
-
Title: DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image CodingSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size.
- [6] arXiv:2408.12275 [pdf, html, other]
-
Title: Whole Slide Image Classification of Salivary Gland TumoursComments: 5 pages, 2 figures, 28th UK Conference on Medical Image Understanding and Analysis - clinical abstractSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This work shows promising results using multiple instance learning on salivary gland tumours in classifying cancers on whole slide images. Utilising CTransPath as a patch-level feature extractor and CLAM as a feature aggregator, an F1 score of over 0.88 and AUROC of 0.92 are obtained for detecting cancer in whole slide images.
- [7] arXiv:2408.12323 [pdf, html, other]
-
Title: EUIS-Net: A Convolutional Neural Network for Efficient Ultrasound Image SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Segmenting ultrasound images is critical for various medical applications, but it offers significant challenges due to ultrasound images' inherent noise and unpredictability. To address these challenges, we proposed EUIS-Net, a CNN network designed to segment ultrasound images efficiently and precisely. The proposed EUIS-Net utilises four encoder-decoder blocks, resulting in a notable decrease in computational complexity while achieving excellent performance. The proposed EUIS-Net integrates both channel and spatial attention mechanisms into the bottleneck to improve feature representation and collect significant contextual information. In addition, EUIS-Net incorporates a region-aware attention module in skip connections, which enhances the ability to concentrate on the region of the injury. To enable thorough information exchange across various network blocks, skip connection aggregation is employed from the network's lowermost to the uppermost block. Comprehensive evaluations are conducted on two publicly available ultrasound image segmentation datasets. The proposed EUIS-Net achieved mean IoU and dice scores of 78. 12\%, 85. 42\% and 84. 73\%, 89. 01\% in the BUSI and DDTI datasets, respectively. The findings of our study showcase the substantial capabilities of EUIS-Net for immediate use in clinical settings and its versatility in various ultrasound imaging tasks.
- [8] arXiv:2408.12534 [pdf, html, other]
-
Title: Automatic Organ and Pan-cancer Segmentation in Abdomen CT: the FLARE 2023 ChallengeJun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, Bo WangComments: MICCAI 2024 FLARE Challenge SummarySubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Organ and cancer segmentation in abdomen Computed Tomography (CT) scans is the prerequisite for precise cancer diagnosis and treatment. Most existing benchmarks and algorithms are tailored to specific cancer types, limiting their ability to provide comprehensive cancer analysis. This work presents the first international competition on abdominal organ and pan-cancer segmentation by providing a large-scale and diverse dataset, including 4650 CT scans with various cancer types from over 40 medical centers. The winning team established a new state-of-the-art with a deep learning-based cascaded framework, achieving average Dice Similarity Coefficient scores of 92.3% for organs and 64.9% for lesions on the hidden multi-national testing set. The dataset and code of top teams are publicly available, offering a benchmark platform to drive further innovations this https URL.
New submissions for Friday, 23 August 2024 (showing 8 of 8 entries )
- [9] arXiv:2408.11829 (cross-list from cs.CV) [pdf, other]
-
Title: FAKER: Full-body Anonymization with Human Keypoint Extraction for Real-time Video DeidentificationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
In the contemporary digital era, protection of personal information has become a paramount issue. The exponential growth of the media industry has heightened concerns regarding the anonymization of individuals captured in video footage. Traditional methods, such as blurring or pixelation, are commonly employed, while recent advancements have introduced generative adversarial networks (GAN) to redraw faces in videos. In this study, we propose a novel approach that employs a significantly smaller model to achieve real-time full-body anonymization of individuals in videos. Unlike conventional techniques that often fail to effectively remove personal identification information such as skin color, clothing, accessories, and body shape while our method successfully eradicates all such details. Furthermore, by leveraging pose estimation algorithms, our approach accurately represents information regarding individuals' positions, movements, and postures. This algorithm can be seamlessly integrated into CCTV or IP camera systems installed in various industrial settings, functioning in real-time and thus facilitating the widespread adoption of full-body anonymization technology.
- [10] arXiv:2408.11885 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: HDN:Hybrid Deep-learning and Non-line-of-sight Reconstruction Framework for Photoacoustic Brain ImagingComments: 8 pages, 8figuresSubjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV); Optics (physics.optics)
Photoacoustic imaging (PAI) combines the high contrast of optical imaging with the deep penetration depth of ultrasonic imaging, showing great potential in cerebrovascular disease detection. However, the ultrasonic wave suffers strong attenuation and multi-scattering when it passes through the skull tissue, resulting in the distortion of the collected photoacoustic (PA) signal. In this paper, inspired by the principles of deep learning and non-line-of-sight (NLOS) imaging, we propose an image reconstruction framework named HDN (Hybrid Deep-learning and Non-line-of-sight), which consists of the signal extraction part and difference utilization part. The signal extraction part is used to correct the distorted signal and reconstruct an initial image. The difference utilization part is used to make further use of the signal difference between the distorted signal and corrected signal, reconstructing the residual image between the initial image and the target image. The test results on a PA digital brain simulation dataset show that compared with the traditional delay-and-sum (DAS) method and deep-learning-based method, HDN achieved superior performance in both signal correction and image reconstruction. Specifically for the SSIM index, the HDN reached 0.606 in imaging results, compared to 0.154 for the DAS method and 0.307 for the deep-learning-based method.
- [11] arXiv:2408.12048 (cross-list from cs.CV) [pdf, html, other]
-
Title: ISETHDR: A Physics-based Synthetic Radiance Dataset for High Dynamic Range Driving ScenesSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
This paper describes a physics-based end-to-end software simulation for image systems. We use the software to explore sensors designed to enhance performance in high dynamic range (HDR) environments, such as driving through daytime tunnels and under nighttime conditions. We synthesize physically realistic HDR spectral radiance images and use them as the input to digital twins that model the optics and sensors of different systems. This paper makes three main contributions: (a) We create a labeled (instance segmentation and depth), synthetic radiance dataset of HDR driving scenes. (b) We describe the development and validation of the end-to-end simulation framework. (c) We present a comparative analysis of two single-shot sensors designed for HDR. We open-source both the dataset and the software.
Cross submissions for Friday, 23 August 2024 (showing 3 of 3 entries )
- [12] arXiv:2312.07137 (replaced) [pdf, html, other]
-
Title: The AIRI plug-and-play algorithm for image reconstruction in radio-interferometry: variations and robustnessSubjects: Image and Video Processing (eess.IV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Plug-and-Play (PnP) algorithms are appealing alternatives to proximal algorithms when solving inverse imaging problems. By learning a Deep Neural Network (DNN) denoiser behaving as a proximal operator, one waives the computational complexity of optimisation algorithms induced by sophisticated image priors, and the sub-optimality of handcrafted priors compared to DNNs. Such features are highly desirable in radio-interferometric (RI) imaging, where precision and scalability of the image reconstruction process are key. In previous work, we introduced AIRI, PnP counterpart to the unconstrained variant of the SARA optimisation algorithm, relying on a forward-backward algorithmic backbone. Here, we introduce variations of AIRI towards a more general and robust PnP paradigm in RI imaging. Firstly, we show that the AIRI denoisers can be used without any alteration to instantiate a PnP counterpart to the constrained SARA optimisation algorithm itself, relying on a primal-dual forward-backward algorithmic backbone, thus extending the remit of the AIRI paradigm. Secondly, we show that AIRI algorithms are robust to strong variations in the nature of the training dataset, with denoisers trained on medical images yielding similar reconstruction quality to those trained on astronomical images. Thirdly, we develop a functionality to quantify the model uncertainty introduced by the randomness in the training process. We validate the image reconstruction and uncertainty quantification functionality of AIRI algorithms against the SARA family and CLEAN, both in simulation and on real data of the ESO 137-006 galaxy acquired with the MeerKAT telescope. AIRI code is available in the BASPLib code library on GitHub.
- [13] arXiv:2404.10506 (replaced) [pdf, html, other]
-
Title: Restoring Connectivity in Vascular Segmentation using a Learned Post-Processing ModelJournal-ref: The First Workshop on Topology and Graph-Informed Imaging Informatics (TGI3), MICCAI 2024 WorkshopSubjects: Image and Video Processing (eess.IV)
Accurate segmentation of vascular networks is essential for computer-aided tools designed to address cardiovascular diseases. Despite more than thirty years of research, it remains a challenge to obtain vascular segmentation results that preserve the connectivity of the underlying vascular network. Yet connectivity is one of the key feature of these tools. In this work, we propose a post-processing algorithm aiming to reconnect vascular structures that have been disconnected by a segmentation algorithm. Connectivity being a complex property to model explicity, we propose to learn this geometric feature either through synthetic data or annotations of the application of interest. The resulting post-processing model can be used on the output of any supervised or unsupervised vascular segmentation algorithm. We show that this post-processing effectively restores the connectivity of vascular networks both in 2D and 3D images, leading to improved overall segmentation results.
- [14] arXiv:2406.02918 (replaced) [pdf, html, other]
-
Title: U-KAN Makes Strong Backbone for Medical Image Segmentation and GenerationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U-KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. These endeavours unveil valuable insights and sheds light on the prospect that with U-KAN, you can make strong backbone for medical image segmentation and generation. Project page:\url{this https URL}.
- [15] arXiv:2408.05892 (replaced) [pdf, html, other]
-
Title: Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer DetectionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Polyp segmentation plays a crucial role in the early detection and diagnosis of colorectal cancer. However, obtaining accurate segmentations often requires labor-intensive annotations and specialized models. Recently, Meta AI Research released a general Segment Anything Model 2 (SAM 2), which has demonstrated promising performance in several segmentation tasks. In this work, we evaluate the performance of SAM 2 in segmenting polyps under various prompted settings. We hope this report will provide insights to advance the field of polyp segmentation and promote more interesting work in the future. This project is publicly available at this https URL sajjad-sh33/Polyp-SAM-2.
- [16] arXiv:2408.09218 (replaced) [pdf, other]
-
Title: FQGA-single: Towards Fewer Training Epochs and Fewer Model Parameters for Image-to-Image Translation TasksSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
CycleGAN was trained on SynthRAD Grand Challenge Dataset using the single-epoch modification (SEM) method proposed in this paper which is referred to as (CycleGAN-single) compared to the usual method of training CycleGAN on around 200 epochs (CycleGAN-multi). Model performance were evaluated qualitatively and quantitatively with quantitative performance metrics like PSNR, SSIM, MAE and MSE. The consideration of both quantitative and qualitative performance when evaluating a model is unique to certain image-to-image translation tasks like medical imaging of patient data as detailed in this paper. Also, this paper shows that good quantitative performance does not always imply good qualitative performance and the converse is also not always True (i.e. good qualitative performance does not always imply good quantitative performance). This paper also proposes a lightweight model called FQGA (Fast Paired Image-to-Image Translation Quarter-Generator Adversary) which has 1/4 the number of parameters compared to CycleGAN (when comparing their Generator Models). FQGA outperforms CycleGAN qualitatively and quantitatively even only after training on 20 epochs. Finally, using SEM method on FQGA allowed it to again outperform CycleGAN both quantitatively and qualitatively. These performance gains even with fewer model parameters and fewer epochs (which will result in time and computational savings) may also be applicable to other image-to-image translation tasks in Machine Learning apart from the Medical image-translation task discussed in this paper between Cone Beam Computed Tomography (CBCT) and Computed Tomography (CT) images.
- [17] arXiv:2404.06493 (replaced) [pdf, html, other]
-
Title: Flying with Photons: Rendering Novel Views of Propagating LightComments: ECCV 2024, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
We present an imaging and neural rendering technique that seeks to synthesize videos of light propagating through a scene from novel, moving camera viewpoints. Our approach relies on a new ultrafast imaging setup to capture a first-of-its kind, multi-viewpoint video dataset with picosecond-level temporal resolution. Combined with this dataset, we introduce an efficient neural volume rendering framework based on the transient field. This field is defined as a mapping from a 3D point and 2D direction to a high-dimensional, discrete-time signal that represents time-varying radiance at ultrafast timescales. Rendering with transient fields naturally accounts for effects due to the finite speed of light, including viewpoint-dependent appearance changes caused by light propagation delays to the camera. We render a range of complex effects, including scattering, specular reflection, refraction, and diffraction. Additionally, we demonstrate removing viewpoint-dependent propagation delays using a time warping procedure, rendering of relativistic effects, and video synthesis of direct and global components of light transport.
- [18] arXiv:2408.10287 (replaced) [pdf, other]
-
Title: Recognizing Beam Profiles from Silicon Photonics Gratings using Transformer ModelSubjects: Optics (physics.optics); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Over the past decade, there has been extensive work in developing integrated silicon photonics (SiPh) gratings for the optical addressing of trapped ion qubits in the ion trap quantum computing community. However, when viewing beam profiles from infrared (IR) cameras, it is often difficult to determine the corresponding heights where the beam profiles are located. In this work, we developed transformer models to recognize the corresponding height categories of beam profiles of light from SiPh gratings. The model is trained using two techniques: (1) input patches, and (2) input sequence. For model trained with input patches, the model achieved recognition accuracy of 0.938. Meanwhile, model trained with input sequence shows lower accuracy of 0.895. However, when repeating the model-training 150 cycles, model trained with input patches shows inconsistent accuracy ranges between 0.445 to 0.959, while model trained with input sequence exhibit higher accuracy values between 0.789 to 0.936. The obtained outcomes can be expanded to various applications, including auto-focusing of light beam and auto-adjustment of z-axis stage to acquire desired beam profiles.