Abstract

In medical image segmentation, both Convolutional Neural Networks (CNNs) and self-attention mechanisms have shown success but also have limitations. CNNs excel at capturing local features but struggle with long-range dependencies, while self-attention effectively models global context but is computationally intensive and may lose fine local details. To address these challenges, we propose CSSWin-UNet, a novel U-shaped architecture that integrates two complementary self-attention mechanisms(Swin and CSWin Transformer) to achieve a balanced representation of both local features and global dependencies, while substantially reducing computational overhead. Experimental results on multiple public benchmark datasets demonstrate that CSSWin-UNet delivers superior segmentation performance with significantly lower model complexity, highlighting its potential for practical deployment in medical imaging tasks.