Document

Abstract

Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME's superiority, achieving significant mean AP improvements over state-of-the-art methods.

Details of Multiple Heterogeneous Ultrasound Dataset Integration

To develop a universal model for heterogeneous ultrasound (US) datasets, we built a benchmark of 4 breast and 4 thyroid US datasets. These datasets come from different sources and exhibit significant domain differences, such as variations in shadow artifacts, speckle noise, grayscale levels, and anatomical structures, as shown in Fig. 1

Fig. 1 (a) presents representative samples from eight datasets, highlighting their distinct imaging variations. Fig. 1 (b) illustrates the distribution of these datasets within the overall inter-organ mixed dataset, revealing a significant data imbalance.

Collaborative Mixture of Heterogeneous Source-Specific Experts (COME)

We propose the Collaborative Mixture of Heterogeneous Experts (COME), a novel framework for lesion detection across multiple heterogeneous US datasets that synergistically integrates dual structural-semantic priors with intra-dataset specific feature disentanglement, as illustrated in Fig. 2

Illustration of the proposed HSformer+ framework;

Fig. 2 (a) Overview of COME. (b) Two shared experts are constructed based on structure-semantic learning for the latent universal feature space. (c) Two clustering strategies in the S²E route tokens from the same dataset to specific experts, enabling expert specialization.

Results of Comparative Experiments

Tab. 1 compares COME with baselines across three training configurations: single-dataset, intra-organ (breast/thyroid) combinations, and inter-organ integration.

Table 1 - Lesion detection results on our benchmark, divided into top, middle, and bottom sections based on training data paradigms. We report mean AP score on each dataset. Among them, bold indicates optimal performance for three paradigms.

Method	Paradigms	BUSI	BUV	BUSBRA	BUSC	DDTI	TUD	TUS	TNSCUI	Mean
Open-GroundingDINO	Single	0.2495	0.3649	0.3025	0.3666	0.5391	0.7255	0.6112	0.6986	0.4822
YOLOv10	Single	0.3161	0.6216	0.5627	0.7052	0.3640	0.7450	0.4210	0.6907	0.5532
DINO	Single	0.4180	0.6754	0.5780	0.7313	0.4609	0.7373	0.6047	0.6945	0.6125
DINO-MoE	Single	0.3388	0.5510	0.5376	0.6706	0.4189	0.7088	0.5248	0.6669	0.5521
DAMEX	Single	0.3270	0.5653	0.5358	0.6458	0.4572	0.7101	0.5313	0.6727	0.5556
CerberusDet	Intra-Organ	0.3470	0.6980	0.6300	0.5770	0.5180	0.6780	0.6370	0.6830	0.5960
DINO	Intra-Organ	0.3509	0.6399	0.5193	0.7112	0.4487	0.6899	0.5666	0.6597	0.5732
DINO-MoE	Intra-Organ	0.3269	0.6795	0.5419	0.671	0.4898	0.7001	0.5801	0.6804	0.5837
DAMEX	Intra-Organ	0.3591	0.6990	0.5404	0.6693	0.5126	0.7000	0.5850	0.6881	0.5941
Our COME (Multi-Step)	Intra-Organ	0.4883	0.7995	0.6748	0.6879	0.5481	0.7032	0.5900	0.6875	0.6474
Our COME (Fine2Coarse)	Intra-Organ	0.4885	0.7876	0.6689	0.6434	0.5443	0.7039	0.5971	0.6922	0.6407
CerberusDet	Inter-Organ	0.4570	0.7940	0.6490	0.6300	0.5500	0.6600	0.6130	0.6620	0.6268
DINO	Inter-Organ	0.3825	0.6597	0.5395	0.7088	0.4583	0.6912	0.5475	0.6461	0.5792
DINO-MoE	Inter-Organ	0.4003	0.6838	0.5560	0.7227	0.5118	0.7103	0.5799	0.6973	0.6077
DAMEX	Inter-Organ	0.4098	0.7007	0.5731	0.7260	0.5214	0.7092	0.5994	0.6942	0.6167
Our COME (Multi-Step)	Inter-Organ	0.4958	0.7859	0.6912	0.7191	0.5503	0.7025	0.5972	0.7049	0.6558
Our COME (Fine2Coarse)	Inter-Organ	0.5159	0.8313	0.6719	0.7266	0.5371	0.7091	0.5725	0.7052	0.6587

Further, we present visualization results of four models trained on the integration benchmark of eight datasets, highlighting image differences and demonstrating the superiority of our COME, as shown in Fig. 3.

Fig. 3 We visualize lesion detection on the inter-organ integration benchmark, comparing the non-MoE model DINO, the MoE-based DINO-MoE and DAMEX, and ours. COME demonstrates strong performance across all heterogeneous datasets.

In the main text, we select one sample per dataset for comparison, as demonstrated in Fig. 3. Here, Fig. 4 shows additional lesion detection examples, demonstrating that our structure-semantic learning-based COME model delivers robust performance on diverse US images and holds promise for real-world clinical applications.

Fig. 4 Additional lesion detection examples from the inter-organ integrated dataset.

Details of Ablations

To validate the effectiveness of each component in COME, we implement five variants: (1) Ours w/o STE; (2) Ours w/o SEE; (3) Ours w/o Dual Shared Experts (-DSE): Discarding dual shared expert modules providing structural-semantic priors; (4) Ours w/o clustering; (5) Ours w/o Traceability loss. Quantitative and qualitative results are shown in Table. 2 and Fig. 5, respectively.

Tab. 2 - Quantitative Performance of the ablation study.

Method	BUSI	BUV	BUSBRA	BUSC	DDTI	TUD	TUS	TNSCUI	Mean
STE	0.4628	0.6802	0.6517	0.7123	0.5008	0.6827	0.5307	0.6704	0.6115
SEE	0.3849	0.6570	0.5927	0.7006	0.5231	0.6859	0.5397	0.6795	0.5954
Dual Shared Experts(-DSE)	0.3853	0.6445	0.5509	0.6897	0.5173	0.6687	0.5189	0.6802	0.5819
Clustering	0.4587	0.7093	0.6590	0.7003	0.5335	0.6952	0.5772	0.6779	0.6264
Traceability Loss	0.4721	0.7211	0.6605	0.6913	0.5341	0.6960	0.5826	0.6981	0.6320
Our CoMS²E(Fine2Coarse)	0.5159	0.8313	0.6719	0.7266	0.5371	0.7091	0.5725	0.7052	0.6587

Fig. 5 Qualitative Performance of the ablations. Sharpest performance drop upon DSE removal validates COME's shared feature space enabling collaborative expert specialization.

Details of Parameter Setting

To resolve potential conflicts in source routing, this section establishes two parameter analyses: [1] Dynamic configuration of number of source-specific experts. It is important to note that the inter-organ dataset (comprising 8 datasets) and the intra-organ dataset (comprising 4 datasets) differ in composition, leading to different ranges when selecting the number of experts. [2] Analysis of the impact of K on the Top‑K selection strategy.

Table 3(a) - Impact of the number of # Expert on the inter-organ integrated dataset.
# Experts	BUSI	BUV	BUSBRA	BUSC	DDTI	TUD	TUS	TNSCUI	Mean
4	0.4693	0.7585	0.6811	0.7179	0.5137	0.7139	0.5880	0.6906	0.6416
8	0.4898	0.8195	0.6919	0.6998	0.5508	0.7062	0.5883	0.7057	0.6565
10	0.4932	0.8061	0.6816	0.6953	0.5460	0.7120	0.5619	0.6713	0.6459

Table 3(b) - Impact of the number of # Expert on the intra-organ thyroid integrated dataset.
# Experts	TUD	TUS	DDTI	TNSCUI	Mean
2	0.5095	0.7092	0.5794	0.6921	0.6226
4	0.5594	0.6932	0.5750	0.6935	0.6303
8	0.5481	0.7032	0.5900	0.6875	0.6322
10	0.5595	0.6919	0.5580	0.6878	0.6243

Table 4 - Impact of Top-K on the S²E Module.
Top-K	BUSI	BUV	BUSBRA	BUSC	DDTI	TUD	TUS	TNSCUI	Mean
K=1	0.4898	0.8195	0.6919	0.6998	0.5508	0.7062	0.5883	0.7057	0.6565
K=2	0.4889	0.7905	0.6852	0.7248	0.5152	0.7067	0.5703	0.6882	0.6462
K=3	0.5159	0.8313	0.6719	0.7266	0.5371	0.7091	0.5725	0.7052	0.6587
K=4	0.5119	0.7646	0.6732	0.7231	0.5283	0.7108	0.4762	0.6783	0.6333

In-Depth Analysis

Q1: Does COME demonstrate zero-shot domain adaptation capability?

Four breast datasets exhibit significant distribution variations, such as source-specific noise and artifacts (similar heterogeneity in thyroid datasets). To validate COME's generalizability, we randomly train on the BUSI and test directly on the other three breast datasets. The performance comparison of DINO (non-MoE setup), DAMEX and our COME is demonstrated in Tab. 5.

Table 5. - The zero-shot domain adaptation ability of the proposed COME.
Method (Train on BUSI)	BUSI	BUV	BUSBRA	BUSC
DINO	0.418	0.481	0.357	0.043
DAMEX	0.327	0.409	0.344	0.115
COME (Fine2Coarse)	0.508	0.561	0.410	0.297

Q2: How Do Feature Representations Evolve Across Clustering Stages?

The proposed COME achieved optimal performance with its Fine2Coarse clustering strategy. As shown in Fig. 6, we visualized feature distributions across three stages: pre-clustering, post fine-grained clustering (16 centers), and final coarse-grained clustering with 8 centers.

Fig. 6 Feature Visualization of Fine2Coarse Clustering.

Simultaneously, we implement a five-step clustering, visualizing three of these steps in the Fig. 7.

Fig. 7 Feature Visualization of Multi-Step Clustering.

Conclusion

This study proposes a dual structure-semantic learning framework for multiple heterogeneous US analysis, demonstrating superior performance in lesion detection. Addressing multi-source data characteristics, we innovatively design two shared experts to construct a universal feature space by integrating structural invariance and semantic consistency, supporting source-specific feature mining. Then a source-specific expert routing module employs clustering for feature decoupling, enhanced by traceability loss to reinforce expert specialization and ensure consistent data routing. Additionally, empirical results validate the effectiveness of intra-/inter-organ data integration strategies, offering new perspectives for the US analysis community, particularly given the increasing availability of public datasets.