On 2025-06-02 17:22:57, user Karl Milcik wrote:
We reviewed this paper as part of our regular journal club. Below is a collection of the comments made by the various group members:<br />
--- 1 ---<br />
It's unclear why asymmetry in the latent embeddings is required.
No mention of the model predicting trivial results during training due to the symmetric KL? Ablation might reveal that the loss weights require very careful tuning to avoid predictions or that the reference distribution is extremely important.
There are a number of implicit assumptions being made with the model architecture, primarily that there is sufficient information to align two datasets. It becomes an issue when combining datasets from very different modalities (e.g. scRNA-seq and sc proteomics). Adding multiple modalities is definitely possible, but the overlapping information becomes smaller and lose additional information. It would be good to see where the model stops working. Small datasets will similarly carry little information: is there a minimum number of samples for the model to function as expected (exact number not required, but getting a sense with a few datasets of different modalities would be informative). As-is, we wouldn't expect the model to apply to most single-cell datasets.<br />
Aligning modalities that are of extremely-different dimensionality implies either redundant information in one modality or information loss. This should be discussed.
Specifics of training, hyperparam optimization, etc. would be better in a supplemental (assuming the targeted venue allows it). The main contribution appears to be the combination of the various losses. The article could be shortened by focusing on that when describing the method.
Re: training procedure. No mention of balancing the different modalities. "Difficult" modalities would be more difficult to learn. early stopping could be preventing complex modalities from being sufficiently mapped because the simpler modalities are overfit faster than the complex ones are learned.
Evaluation metrics: NMI is very similar to the symmetric KL that is used to train the model. I'm not sure if it's a reliable metric for this.
Fig. 2a: the figure amounts to "the model removed information," which is the point of batch correction but doesn't quantify what other information was lost. Fig. 6 suggests that there is quite a bit of biological information is lost.
Fig. 3: scRNA reconstruction is producing high values for some genes when it shouldn't (purple cluster, top). If one were to use this, we would conclude that those genes are highly differentially expressed when they are not in the original data. This is a fatal problem.
--- 2 ---<br />
1. Lack of Evaluation in Downstream Biological Applications<br />
While UniVI shows strong performance in latent space alignment and cross-modality prediction, its utility in downstream biological tasks (e.g., identifying novel cell subtypes, inferring regulatory programs, or reconstructing differentiation trajectories) remains under explored. Demonstrating improvements in real biological discovery would substantially enhance the manuscript's impact.<br />
2. Insufficient Validation of Generalizability Across Conditions<br />
The datasets used in evaluation are mostly standard and clean (e.g., PBMCs from 10x Genomics). It is unclear whether UniVI generalizes well to more diverse or challenging settings (e.g., different sequencing technologies, species, or tissues).<br />
3. No Ablation Studies to Justify Model Design<br />
The architecture includes several important design choices (e.g., β-VAE, shared and private latent spaces, MoE layers), but the manuscript lacks ablation experiments to validate the contribution of each component.<br />
4. Lack of Interpretability for Latent Space Representations<br />
The latent space is central to UniVI’s function, but its biological interpretability is not addressed. It is unclear which features (genes, peaks, proteins) drive the alignment, or how latent dimensions relate to known biology.<br />
5. Failure Cases and Limitations Are Not Discussed<br />
The manuscript does not address situations where UniVI might fail or yield poor alignments. Understanding when and why the method breaks down would be critical for end users.
--- 3 ---<br />
1) They mention that scATAC-seq is not reliable for determining cell type specificity, then why did they necessarily include ATAC-seq?
2) The dataset they use are reliable but I think it would be good for them to mention why exactly they preferred these dataset and databases, there is not much information about this
--- 4 ---<br />
Figure 4: recommend labeling panels rather than referring to top left, etc. In the boxplots at the top left, uniVI and totalVI seem really similar in NMI, ARI, ACC but no formal statistical comparison done<br />
usability may be limited if you have to manually fit the model with your own data<br />
is overfitting a problem with very small datasets? is computational time a problem with very large datasets (eg early stopping used)?
--- 5 ---<br />
-Use of the model to generate new data is stated and referenced throughout, but I felt the true utility of this is underexplored. Why would someone want to do this? The authors mentioned data augmentation, but the authors could be more explicit on any other uses.
-Did the authors consider using alternative methods to grid search for their training procedure (e.g., neural architecture search)? Also what were the ranges of values searched and with what step sizes?
-For adding >2 modalities, are there any considerations with computational complexity and training time at a certain point? How would this scale to K>2?
-In general, the paper is well organized and detailed, but almost to a fault. I suggest moving details less relevant to the average reader into a supplemental section. For example, knowing the function calls and variables probably isn't relevant to most readers. Those that want to know that could look in the code or point the reader to a supplement. These somewhat irrelevant details to the figures were also mixed with critical details such that I felt a little lost on trying to pick out the most important parts of the methods.
-On the same note, simple details are often over-explained or restated multiple times in the text (e.g., the explanation for subsetting the data to obtain non-overlapping labels is repeated several times), while more complex concepts such as the Beta term, mixture of experts model, etc. are often underexplained in my opinion.
-For Figure 1, I am still confused on what exactly UniVI provides a benefit over in some panels versus just looking at individual UMAPs and annotating by the labels, since these are already known? More specific explanation on why a shared latent space is usual to find new biology would help.
-Exploring more on the fringe cases in which data does not align is interesting. For example, the authors mention cell 59 aligning closer to a Dendritic cell than B cell. They mention this could be biological variation or technical error, but exploring more about this 'misalignment' in this and other datasets could be be a key way of identifying unique insights from this model, though would require biological validation. Perhaps the authors could suggest some such experiments as future work to tie in dry and wet lab approaches/experimental designs that would complement this model in the lab.
--- 6 ---<br />
In the paper authors mention that approximately 1% of the dataset shows inconsistent alignment. Could you elaborate on how this might be interpreted as reflecting dynamic cellular states in continuous development? A deeper discussion of this would be very helpful.
--- 7 ---<br />
Figure 7: how to prove that the reconstruction retains the biology signal or better illustrate:<br />
It’s weird that the error did not increase significantly with the higher dropout rate.<br />
As well as for the Correlation<br />
When no dropout is applied, the correlation between the raw and reconstructed data is only 0.52. Does this suggest that the pathways have changed significantly? It may be necessary to check which pathways have changed and which have not.
--- 8 --- <br />
Lack of QC metrics and if there were any filtering involved for the data. Transparency is missing in the QCs.
--- 9 ---<br />
A limitation is that this must be only used for measurements made from the exact same cells - we cannot apply this framework to cells measured in parallel with different methods
Figure 2 not sure that they compared to CCA or OT as those were introduced alternatives in the beginning.
Figure 2 : I like that they show the measurement pairs for each cell - can they quantify this globally somehow?
The distinction between “imputation” and alternative mode reconstruction is unclear from their description; they mention fitting a gaussian mixture model with their data and then using that for input - does that mean they use the true values from one measurement modality and then use all zeros for the other? Why not simply run a forward pass from the one modality encoder and then use the opposite decoder?
They comment on higher expression levels having higher reconstruction MSE - this is a common feature of autoencoders that compress the range of predictions so as to minimize error from any large magnitude predictions. The methods claim to have used pp.scale() which should have removed this effect of the measurements original magnitude?
It would be interesting to know what are the limits in terms of minimum (or maximum) features per modality and minimum measurements for training.
Based on figure 4, the claim that uniVI “outperforms existing state of the art integration methods does not appear to be statistically supported. It appears to be indistinguishable from TotalVI and perhaps even Seurat. The authors should compute p values using random samples of the data with replacement (I think these experiments used identical samples, which would violate the assumption of independence for t-testing). TotalVI appears to have been published over 4 years ago in Nature Methods. However they claim that TotalVI requires “modality specific priors”. This “prior” appears to be a specific model term that is learned from the data to account for background, so I agree that uniVI is more generalized but not by as much as I thought before seeing this prior work.
The authors should be careful about statements of distance based on UMAP “The model preserved meaningful cellular distinctions, with closely related populations remaining spatially proximate in the latent space, underscoring UniVI’s ability to harmonize intra-modality variation while retaining biologically relevant structure.”
Figure 6C is a neat application of this data. Does this scale beyond this data and how can it be less slushy in the representations?
Can this be fit on very deep single cell omic data and then applied to predict missing depth from more shallow studies?
It would be interesting to repeat the dropout experiment with multiple random dropouts to get a sense of variance in the genes that are dropped out.
I’m confused why the pre and post reconstruction heatmaps in figure 7 bear no resemblance even with 0% dropout. Are these hierarchically clustered differently or should we be able to compare the shapes between them.
Is there overlapping information between true SCP and SCT (beyond cite-seq where the proteomic measurement part is substantially limited based on the number of antibodies)?
Does this work well beyond measurements from blood cells (what seems like an easy case)?
--- 10 ---<br />
I was hoping to see more of the unified cell state concept play out in its experiments. I feel like they got sidetracked (or rather, realized they didn’t have enough to really fulfill that ambition), but it would be nice to have that addressed more clearly.
I was wondering if weights trained for a single modality as paired to a second modality could be transferred to a third modality comparison. Doubtful, but it would be interesting to explore.<br />
Not sure if this is something that you actually want to include in the review. It was more what I was focusing on and was somewhat dissatisfied by.
The text in the figures is too small to read, generally speaking. I found issues with all figures with the possible exception of the first.<br />
Figure 1b, Cell-Cell Alignment is not intuitive. It goes from a UMAP to decode as a graph figure, and is not consistent with the batch correction element of the same subfigure. It’s an odd inconsistency.