SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

Published on
Embed video
Share video
Ask about this video

Scene 1 (0s)

[Virtual Presenter] Hello everyone, I am Muhammad Umar Farooq, a Phd student from Hanyang University, and it’s a pleasure to present our work done in collaboration with various colaborators including Abd Ur Rehman from The University of Alabama, Azka Rehamn from Seoul National University, Seoul, and Muhammad Usman from Stanford University, CA. Today, I present SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images. This presentation is based on our accepted paper at WACV 2026..

Scene 2 (44s)

[Audio] A thyroid nodule is a focal lesion within the thyroid gland that appears visually distinct from the surrounding tissue. Accurate segmentation of these nodules is clinically important because it directly impacts diagnosis, treatment planning, and follow-up decisions. However, thyroid ultrasound segmentation is a very challenging task. There are several intrinsic challenges. First, the nodule textures are often heterogeneous — nodules can have mixed echogenicity, making them visually inconsistent. Second, boundaries are frequently ambiguous. The transition between nodule and normal tissue is not always clearly defined. Additionally, nodules often exhibit irregular shapes, which makes shape-based assumptions unreliable. Finally, acoustic shadows caused by ultrasound physics can obscure parts of the lesion, leading to incomplete or misleading visual cues. In the examples shown below, the green arrows highlight visually complex nodules in the top row, while the red contours in the bottom row show the corresponding expert annotations. You can see that even for specialists, these cases are non-trivial..

Scene 3 (1m 58s)

[Audio] let's examine the limitations of existing methods. First, most prior work focuses on single-task nodule segmentation. These models typically segment the nodule alone, without incorporating gland-aware context or clinically relevant information such as size or structural relationships. This limits their ability to leverage anatomical cues that are important in real clinical scenarios. Second, many approaches are CNN-dominant. While CNNs are effective at capturing local spatial features, they struggle with long-range dependencies and ambiguous boundaries, which are very common in thyroid ultrasound images. On the other hand, Transformer-based models improve global context modeling. However, they often come with higher computational cost and typically require larger datasets to generalize well. In medical imaging, where labeled data is limited, this can be a significant constraint. Finally, there is the issue of generalizability. Many models show performance drops across different hospitals or scanners due to domain shifts. This limits real-world deployment. As summarized in Table 1, most previous methods rely either on CNNs or Transformers, and very few integrate multitask learning within a unified framework. Our approach is designed to address these gaps by combining local and global modeling with multitask supervision..

Scene 4 (3m 32s)

[Audio] To address the limitations we just discussed, we propose SSMT-Net, a semi-supervised multitask Transformer–CNN framework.Our design is built around three key ideas.First, semi-supervised learning. In clinical practice, labeled ultrasound data is limited, but unlabeled data is abundant. We leverage this unlabeled data to improve representation learning and robustness.Second, a multitask framework. Instead of treating nodule segmentation as an isolated problem, we jointly learn related tasks. This encourages the network to capture richer anatomical and structural information.Third, a Transformer–CNN hybrid architecture.The CNN component focuses on precise local boundary extraction, while the Transformer captures global shape and long-range dependencies. By combining both, we aim to balance fine details with contextual understanding.As shown on the right, our model jointly performs:Nodule segmentation as the main taskNodule size predictionGland segmentationAnd image reconstruction as an unsupervised auxiliary taskThe outputs include the nodule mask, gland mask, restored image, and size estimation. These tasks share representations, which improves generalization and boundary precision.In summary, SSMT-Net integrates semi-supervision, multitask learning, and hybrid modeling into a unified framework designed specifically for the challenges of thyroid ultrasound segmentation..

Scene 5 (5m 19s)

[Audio] Here I present the overall architecture of SSMT-Net.The model begins with the input ultrasound images for both the nodule and the gland. These are passed into a hybrid CNN–Transformer encoder.The CNN encoder extracts rich local spatial features, which are crucial for capturing fine boundary details. In parallel, the Transformer encoder models long-range dependencies and global structural information. Skip connections are used to preserve multi-scale features and stabilize training.From this shared feature representation, the network branches into three task-specific components.First, the Reconstruction Branch.This CNN-based decoder reconstructs the input image. The goal here is to enforce structural consistency and leverage unlabeled data through reconstruction supervision. This strengthens the encoder's feature learning capability.Second, the Dual Segmentation Decoders.We use separate decoders for gland segmentation and nodule segmentation. The CNN decoder focuses on boundary precision, while the Transformer decoder enhances global shape consistency. This dual design allows us to capture both detailed edges and overall anatomical structure.Third, the Nodule Size Estimation Branch.A fully connected regression head predicts the nodule size from the shared feature representation. This task introduces clinically relevant supervision and encourages the network to learn size-aware representations.Importantly, all three tasks share the same encoder features. This shared representation improves robustness, reduces overfitting, and enhances generalization.In summary, SSMT-Net integrates hybrid feature extraction with multitask learning, allowing the model to jointly perform reconstruction, dual segmentation, and size estimation within a unified framework..

Scene 6 (7m 24s)

[Audio] Instead of using a single decoder for both structures, we introduce two dedicated decoders: one for gland segmentation and one for nodule segmentation.The intuition is that gland and nodule structures have different characteristics. The gland occupies a larger region with smoother boundaries, while nodules are smaller, irregular, and often ambiguous. Sharing one decoder can limit specialization.Our segmentation module combines a CNN decoder and a Transformer decoder.The CNN decoder emphasizes local spatial refinement, which is essential for capturing fine boundary details. Meanwhile, the Transformer decoder performs mask-based attention, allowing the model to capture global context and long-range structural consistency.We further apply a coarse-to-fine attention refinement strategy. The CNN first produces an initial coarse prediction, and the Transformer refines it through attention mechanisms. This progressive refinement helps handle ambiguous boundaries and irregular shapes.As shown on the right, the module outputs both the gland mask and the nodule mask.Overall, this dual-decoder design enables structure-specific learning while maintaining global consistency, leading to tighter and more accurate segmentation boundaries..

Scene 7 (8m 49s)

[Audio] We evaluate our model on three publicly available thyroid ultrasound datasets.DDTI contains 637 images.TN3K includes 3,493 images, with 2,879 used for training and 614 for testing.TG3K contains 3,585 images and is used primarily for auxiliary supervision.For preprocessing, all images are normalized and resized to 224 by 224.To improve robustness and reduce overfitting, we apply data augmentation including flipping, rotation, zoom-out, and stitching.Now moving to the training objective.For segmentation, both the nodule and gland tasks use the Dice loss, which directly optimizes region overlap and helps handle class imbalance.For the nodule size estimation task, we use the Mean Squared Error loss to perform regression on the predicted size.For the reconstruction branch, we adopt the Charbonnier loss, which is a smooth and robust alternative to L1 loss. This helps stabilize training and improves reconstruction quality.The overall loss is a weighted combination of all components:L_total equals alpha times L_nodule, plus beta times L_gland, plus gamma times L_size, plus eta times L_rec.These weights balance the contribution of each task during joint optimization.By combining segmentation, regression, and reconstruction objectives, the network learns more comprehensive and clinically meaningful representations..

Scene 8 (10m 37s)

[Audio] Please notice three things in these results. We beat SOTA across both metrics—78.34% IoU beats Deblurring-MIM by +3.38% absolute. 86.94% DSC beats TnSeg by +1.23%. Massive baseline gain—our multitask design gives +4.17% IoU over plain TransUNet. That's your semi-supervised + multitask payoff. Most stable performance—our std dev is ±0.15 vs competitors' ±0.43. In clinics, consistency highest performance..

Scene 9 (11m 22s)

[Audio] We start with the baseline model, which is TransUNet. It achieves an IoU of 74.17%.When we add the reconstruction branch, performance improves by 0.53%. This suggests that reconstruction supervision helps the encoder learn more robust representations.Next, when we introduce gland segmentation as an auxiliary task, the IoU increases by 1.20%. This indicates that gland-aware context provides meaningful anatomical guidance for nodule segmentation.Adding the size prediction task further improves performance by 1.09%. This shows that incorporating clinically relevant regression signals enhances feature learning.Finally, when we combine all components — reconstruction, gland segmentation, and size estimation — the full model achieves 78.34% IoU, resulting in a total gain of 4.17% over the baseline.We also observe consistent improvements in DSC, gland DSC, and size MAE, demonstrating that each task contributes positively and that the multitask design works synergistically rather than competitively.Overall, the ablation results validate that each module plays a complementary role, and the combined model yields the best performance..

Scene 10 (12m 52s)

[Audio] The green contours represent the ground truth, and the red contours are the model predictions.From left to right, we progressively add each component to the baseline.Starting with the baseline, we observe noticeable boundary deviations, especially around ambiguous regions and irregular shapes.When reconstruction is added, the predictions become slightly more stable, but boundary mismatches remain.After introducing gland segmentation, the contours better align with the anatomical structure, particularly along complex edges.Adding size prediction further refines the shape consistency.Finally, our full SSMT-Net model produces the tightest boundaries and the closest alignment with the ground truth in both samples. The improvement is especially clear in challenging areas where the boundaries are blurred or irregular.These qualitative results visually confirm the quantitative gains shown in the ablation table — each module incrementally improves boundary precision, and the combined model achieves the most accurate segmentation..

Scene 11 (14m 1s)

[Audio] we proposed SSMT-Net, a semi-supervised multitask framework for thyroid ultrasound segmentation that integrates a hybrid Transformer–CNN dual-decoder architecture. Our approach leverages unsupervised reconstruction to utilize unlabeled data, followed by joint supervised optimization across segmentation and size estimation tasks. This design enables more label-efficient and clinically meaningful representation learning. We further introduced an iterative coarse-to-fine attention refinement mechanism, which dynamically guides cross-attention toward evolving foreground regions. This significantly improves the delineation of small and low-contrast nodules — one of the key challenges in thyroid ultrasound imaging. Additionally, the explicit size estimation head injects scale and morphology priors, helping regularize boundary precision and enhance structural consistency.Extensive experiments demonstrate state-of-the-art performance on TN3K and strong transferability to DDTI under limited-label settings, supported by comprehensive ablation and robustness studies. Overall, our work shows that combining semi-supervision, multitask learning, and hybrid modeling provides a practical and effective direction for robust thyroid ultrasound segmentation..

Scene 12 (15m 28s)

[Audio] Thank you.. Thank you. 12.