Author response:
The following is the authors’ response to the original reviews.
Public Reviews:
Reviewer #1 (Public review):
Summary:
In the paper, the authors propose a new RNA velocity method, TSvelo, which predicts the transcription rate linearly based on the expression of RNA levels of transcription factors. This framework is an extension of its recent work TFvelo by including unspliced reads and designing a coherent neuralODE framework. Improved performance was demonstrated in six diverse datasets.
Strengths:
Overall, this method introduces innovative solutions to link cell differentiation and gene regulation, with a balance between model complexity (neuralODE) and interpretability (raw gene space).
We thank the reviewer for the positive evaluation of our work and for recognizing the novelty of the proposed framework. We appreciate the reviewer’s summary highlighting that TSvelo extends our previous method TFvelo by incorporating unspliced reads and introducing a coherent neuralODE framework to model transcription dynamics.
We are encouraged that the reviewer recognizes the potential of our approach to link cell differentiation with gene regulatory mechanisms, while maintaining a balance between model expressiveness and interpretability in the gene expression space. In the revised manuscript, we have further clarified several methodological details and strengthened the presentation to better highlight these aspects.
Weaknesses:
While it seems to provide convincing results, there are multiple technical concerns for the authors to clarify and double-check.
(1) The authors should clarify and discuss the TF-target map: here, the TF-target genes map is predefined by the TF binding's ChIP-seq data. This annotation is largely incomplete and mostly compiled from a set of bulk tissues. Therefore, for a certain population, the TF-target relation may change. This requires clarification and discussion, possibly exploring how to address this in the model. In addition, a regulon database could be added, e.g., DoRothEA?
We thank the reviewer for this important comment. The TF–target maps used in TSvelo (e.g., derived from ChIP-seq-based resources such as ENCODE) reflect aggregated TF binding evidence collected across diverse bulk cell types and experimental conditions. As such, they are inherently incomplete and do not capture fully context-specific regulatory activity in a given primary tissue. In TSvelo, we therefore do not treat these annotations as fixed or cell-type-specific ground truth regulatory relationships. Instead, they are used as a permissive prior that encodes a broad set of potential regulatory interactions.
Within the TSvelo framework, the contribution of each TF–target interaction is learned from data through weight estimation, allowing the model to down-weight or effectively ignore prior edges that are inconsistent with the observed single-cell expression dynamics. This design enables TSvelo to remain robust even when the prior TF–target map is noisy, incomplete, or derived from heterogeneous bulk contexts.
Following the reviewer’s suggestion, we additionally incorporated the DoRothEA regulon database as an alternative prior with confidence-level filtering. We further performed ablation studies on the pancreas dataset and the gastrulation erythroid dataset using different TF–target resources, including ChEA, ENCODE, and their combinations with DoRothEA.
The results on the pancreas dataset and the gastrulation erythroid dataset are shown in Figure S13 and Figure S14 respectively, which come up with the same conclusion. We observed highly consistent results across most TF–target prior combinations, including ChEA, ENCODE, ChEA+ENCODE, ChEA+DoRothEA, ENCODE+DoRothEA, and ChEA+ENCODE+DoRothEA. Using the pancreas dataset as example, the mean velocity consistency ranged from 0.985 to 0.995, the mean in-cluster coherence ranged from 0.983 to 0.992, and the mean cross-boundary direction correctness ranged from 0.719 to 0.740 across all settings. These consistently high and tightly bounded metrics indicate that TSvelo is largely insensitive to the specific choice of TF–target prior.
The only configuration showing reduced stability was the use of DoRothEA alone, particularly in terms of cross-boundary direction correctness. This is likely due to its comparatively limited coverage of TF–target interactions. For instance, in the pancreas dataset, only 81 out of 2000 highly variable genes (HVGs) could be associated with TFs based on DoRothEA, corresponding to 102 TF–target links in total, which may restrict downstream regulatory modeling. In contrast, ChEA covered 1793 genes with 13,976 TF–target links, and ENCODE covered 1854 genes with 33,076 links. These results further suggest that integrating multiple TF–target resources could improve performance, likely due to increased coverage and complementary regulatory information.
We further acknowledge that regulatory interactions are inherently context-dependent, and that no static TF–target resource can fully capture tissue-specific regulatory programs. In the revised Discussion, we explicitly clarify this limitation and highlight that incorporating context-specific regulatory data (e.g., single-cell chromatin accessibility or perturbation-based regulatory maps) represents an important direction for future improvement.
(2) The authors should clarify how example genes are selected. This is particularly unclear in Figure 2d.
We thank the reviewer for raising this point. The example genes shown in Fig. 2d were selected to illustrate representative scenarios where our method provides advantages, particularly cases in which the unspliced–spliced 2D phase portrait exhibits mixed or overlapping patterns that are difficult to model using conventional RNA velocity approaches. These examples are therefore intended to demonstrate the types of transcriptional dynamics that TSvelo is designed to better capture.
To avoid the impression of selective presentation, we note that our conclusions are based on systematic evaluation across all genes and datasets. Additional visualizations for a broader set of genes on this dataset are provided in Fig. S3. We have clarified the example gene selection criteria in the revised manuscript.
(3) The authors should clarify confidence in the statement in lines 179-180, that ANXA4 should initially decrease. This is particularly concerning, as TSvelo didn't capture the cell cycle transitions well during the initial part.
We thank the reviewer for raising this point. The statement that ANXA4 initially decreases is based on the observed expression pattern in the dataset rather than on cell-cycle–related dynamics inferred by the model. Specifically, ANXA4 shows higher expression in Ductal cells compared to Ngn3 EP cells, and Ductal represents an earlier stage in the developmental trajectory. Therefore, along the Ductal to Ngn3 EP transition, ANXA4 naturally exhibits an initial decrease in expression. We have clarified this point in the revised manuscript.
(4) A support reference should be added for the statement in line 260 that "neuron migrations are inside-out manner". There is no reference supporting this, and this statement is critical for the model assessment.
We thank the reviewer for this suggestion. This pattern has been reported in previous studies [1,2], which have been added into the revised manuscript.
To Improve clarity, we have also revised the statement in the manuscript as follows:
“During cortical development, neurons follow an inside-out layering pattern in which earlier-born neurons populate the deep cortical layers, whereas later-born neurons migrate past them to occupy more superficial layers.”
(1) Nadarajah, B., Parnavelas, J. Modes of neuronal migration in the developing cerebral cortex. Nat Rev Neurosci 3, 423–432 (2002).
(2) Li, C., Virgilio, M.C., Collins, K.L. et al. Multi-omic single-cell velocity models epigenome–transcriptome interactions and improves cell fate prediction. Nat Biotechnol 41, 387–398 (2023).
(5) The comparison to scMultiomics data is particularly interesting, as MultiVelo uses ATAC data to predict the transcription rate. It would be very insightful to add a direct comparison of the estimated transcription rate between using ATAC and directly using TFs' RNA expressions.
We thank the reviewer for suggesting this highly interesting comparison between ATAC-derived regulatory activity and TF RNA-based proxies for transcription rate estimation.
We have conducted the requested analysis by computing gene-wise chrome accessibility rate used in MultiVelo and the learned transcription rate from TSvelo, and evaluated their correlation across genes. As shown in Figure S15, the two estimates exhibit almost no global correlation across genes, indicating that they capture substantially different aspects of regulatory information.
This discrepancy is not unexpected and reflects the fundamental differences between these modalities. scATAC-seq measures chromatin accessibility, which provides a proxy for cis-regulatory potential of genomic regions. However, ATAC signals are inherently sparse and often exhibit a near-binary structure, limiting their ability to directly capture fine-grained temporal regulatory dynamics. In contrast, TF RNA expression reflects downstream transcriptional output, which is shaped by multiple regulatory layers, including post-transcriptional regulation, protein activity, temporal delays, and indirect regulation through intermediate transcriptional or signaling pathways. As a result, these two modalities are expected to capture complementary but not directly comparable aspects of gene regulation.
Overall, this result suggests that ATAC-based and TF RNA-based signals capture distinct aspects of gene regulation. This further implies that integrating both modalities may be beneficial for future models that aim to more comprehensively characterize transcriptional regulation. We have added this discussion to the supplementary information.
(6) In Figure 6g, it should be clarified how the lineage was determined. Did the authors use the LARRY barcodes, predicted cell fate, or any other methods? Here, the best way is probably using the LARRY barcodes for individual clones.
We thank the reviewer for this suggestion. The lineage assignment used in Fig. 6g is described in the Methods section (“Lineage segmentation and pseudotime initialization”). Briefly, lineages are inferred from the transcriptomic structure of the data by performing Leiden clustering followed by PAGA-based connectivity analysis. Starting from an initial Leiden cluster, the filtered PAGA graph defines the shortest paths to other clusters, which are considered as the detected lineages, and diffusion pseudotime (DPT) is then used to initialize pseudotime along each lineage. Thus, in this analysis lineages are determined from the expression-derived trajectory structure. We have clarified this point in the revised manuscript and refer readers to the Methods section.
Reviewer #2 (Public review):
Summary:
Li et al. propose TSvelo, a computational framework for RNA velocity inference that models transcriptional regulation and gene-specific splicing using a neural ODE approach. The method is intended to improve trajectory reconstruction and capture dynamic gene expression changes in scRNA-seq data. However, the manuscript in its current form falls short in several critical areas, including rigorous validation, quantitative benchmarking, clarity of definitions, proper use of prior knowledge, and interpretive caution. Many of the authors' claims are not fully supported by the evidence.
We thank the reviewer for the careful evaluation of our manuscript and for the constructive comments. We appreciate the concerns regarding validation, benchmarking, methodological clarity, and interpretation. In the revised manuscript, we have carefully addressed these points by adding additional analyses, clarifying methodological details, and moderating several claims to ensure they are fully supported by the data. Detailed responses to each comment are provided below.
Major comments:
(1) Modeling comments
(a) Lines 512-513: How does the U-to-S delay validate the accuracy of pseudotime? Using only a single gene as an example is not sufficient for "validation."
We thank the reviewer for this important clarification. In the revised manuscript, we have rephrased this part to clarify that Fig. 1a serves only as an illustrative example showing the U-to-S delay for a single gene. Accordingly, we have corrected our statement to indicate that the U-to-S delay is used to infer trajectory orientation, rather than to validate the accuracy of pseudotime.
In addition, we have expanded the description to explain that U-to-S delay signals are aggregated across all genes to provide a more robust and comprehensive assessment for this purpose. Additional analysis is provided in our response to the next comment.
(b) Lines 512-518: The authors propose a strategy for selecting the initial state, but do not benchmark how accurate this selection procedure is, nor do they provide sufficient rationale. While some genes may indeed exhibit U-to-S delay during lineage differentiation, why does the highest U-to-S delay score indicate the correct initiation states? Please provide mathematical justification and demonstrate accuracy beyond using a single gene example. Maybe a simulation with ground truth could help here, too.
We thank the reviewer for this insightful comment. In the revised manuscript, we have clarified both the intuition and justification of this approach. Briefly, along a correctly oriented trajectory, unspliced (U) expression is expected to precede spliced (S) expression due to transcriptional dynamics. Ideally, this U-to-S delay would be observable at the level of individual genes. However, due to the high noise inherent in scRNA-seq data, such delays are often not consistently detectable on a per-gene basis. To address this, we aggregate U-to-S delay signals across all genes and determine the lineage orientation by maximizing a global delay score. Under this criterion, the cluster from which all outgoing lineages exhibit the highest aggregated U-to-S delay is inferred to correspond to the initial state.
We emphasize that this approach relies on genome-wide aggregation rather than any single gene. Moreover, the same strategy is applied uniformly across all six datasets using identical parameter settings, demonstrating its robustness and stability. To further address the reviewer’s concern, we additionally present the U-to-S delay scores for each Leiden cluster when treated as the initial state across all datasets (Author response image 1). The results on all datasets suggest that the highest U-to-S delay scores can be used to detect the initial cluster.
Author response image 1.
The U-to-S delay scores for each Leiden cluster when treated as the initial state across all datasets.

Following your suggestions, we also add a simulation study. We generated synthetic single-cell RNA velocity datasets using a mechanistic transcriptional dynamics model with one or multiple developmental branches. The system included 200 genes, among which 30 were designated as transcription factors (TFs).
For each branch, we independently sampled a TF–target regulatory matrix W ϵ R<sup>30×200</sup> from a standard normal distribution to simulate distinct GRN structures. Gene expression dynamics were modeled using a coupled ordinary differential equation (ODE) system describing unspliced and spliced RNA abundances:


where u and s denote unspliced and spliced RNA levels, respectively. The transcription rate α was computed as a nonlinear function of TF expression, defined as a weighted sum of spliced TF abundance, followed by clipping to ensure bounded activation.
Each branch is initialized from the same randomly sampled initial condition drawn from a gamma distribution, allowing controlled divergence of trajectories driven solely by branch-specific regulatory programs.
To simulate observed sequencing counts, we introduced technical noise by scaling latent expression levels with cell-specific library sizes drawn from a log-normal distribution. The resulting expression counts were generated using a negative binomial sampling model:

where θ controls over dispersion, with smaller values corresponding to higher noise levels. The final datasets consist of paired unspliced (U) and spliced (S) count matrices with realistic transcriptional stochasticity and branching gene regulatory dynamics. For each branch, cells were further divided into three developmental stages for downstream analysis.
We evaluated TSvelo on multiple simulated datasets with varying numbers of branches and noise levels. There are two or three branches start from the same root cell groups in these datasets (Branch 1: stage 0 - stage 1 - stage 2. Branch 2: stage 0 - stage 3 - stage 4. Branch 3: stage 0 - stage 5 - stage 6). The results of initial state identification based on the unspliced-to-spliced (U-to-S) delay, along with the corresponding 2D velocity stream visualizations, are presented in Supplementary Figure S1. These results demonstrate that the U-to-S delay–based initialization is robust and consistently identifies cells corresponding to the earliest developmental stage (“stage 0”) across different simulation settings. All additional results have been included in the Supplementary Information.
(c) Equation (8): The formulation looks to be incorrect. If $$W \in \mathbb{R}^{G\times G}$$ and $$W' - \Gamma' \in \mathbb{R}^{K\times K}$$, how can they be aligned within the same row? Please clarify.
We thank the reviewer for pointing this out. This was a typographical error in the manuscript. In the third line of Equation (8), the term should be W’ instead of W. We have corrected this in the revised manuscript to ensure dimensional consistency.

(d) The use of prior knowledge graphs from ENCODE or ChEA to constrain regulation raises concerns. Much of the regulatory information in these databases comes from cell lines. How can such cell-line-based regulation be reliably applied to primary tissues, as is done throughout the manuscript? Additional experiments are needed to test the robustness of TSvelo with respect to prior knowledge.
We thank the reviewer for this important comment. In TSvelo, TF–target networks from resources such as ENCODE and ChEA are incorporated as priors that guide the model toward biologically plausible regulatory structures. Importantly, the contribution of each TF–target interaction is learned from the data, allowing the model to down-weight or override potentially inaccurate or context-mismatched regulatory links. By aggregating signals across a large number of genes, the model further reduces sensitivity to noise and incompleteness in any single prior network.
To evaluate robustness with respect to prior knowledge, we incorporated the DoRothEA regulon resource as an alternative TF–target prior with confidence-level filtering. We further performed ablation studies on the pancreas dataset and the gastrulation erythroid dataset using different TF–target resources, including ChEA, ENCODE, and their combinations with DoRothEA.
The results on the pancreas dataset and the gastrulation erythroid dataset are shown in Figure S13 and Figure S14 respectively, which come up with the same conclusion. We observed highly consistent results across most TF–target prior combinations, including ChEA, ENCODE, ChEA+ENCODE, ChEA+DoRothEA, ENCODE+DoRothEA, and ChEA+ENCODE+DoRothEA. Using the pancreas dataset as example, the mean velocity consistency ranged from 0.985 to 0.995, the mean in-cluster coherence ranged from 0.983 to 0.992, and the mean cross-boundary direction correctness ranged from 0.719 to 0.740 across all settings. These consistently high and tightly bounded metrics indicate that TSvelo is largely insensitive to the specific choice of TF–target prior. Notably, these results further suggest that even when the underlying regulatory resources differ in origin (e.g., cell-line-derived vs. curated or aggregated datasets), the inferred dynamics remain stable.
The only configuration showing reduced stability was the use of DoRothEA alone, particularly for cross-boundary direction correctness. This is likely due to its comparatively limited coverage of TF–target interactions. For instance, in the pancreas dataset, only 81 out of 2000 highly variable genes (HVGs) could be associated with TFs based on DoRothEA, corresponding to 102 TF–target links in total, which may limit downstream regulatory modeling. In contrast, ChEA covered 1793 genes with 13,976 TF–target links, and ENCODE covered 1854 genes with 33,076 links. These results further suggest that integrating multiple TF–target resources can improve performance, likely due to increased coverage and complementary regulatory information.
We agree that regulatory interactions derived from resources such as ENCODE and ChEA may not fully generalize to primary tissues due to their context-dependent nature. In the revised Discussion, we explicitly clarify this limitation, particularly their inability to capture tissue-specific regulatory programs. We further highlight that incorporating context-specific regulatory data, such as single-cell chromatin accessibility or perturbation-based regulatory maps, represents an important direction for future improvement.
(e) Lines 579-580: How is the grid search performed? More methodological details are required. If an existing method was used, please provide a citation.
The grid search for the time step means that the model evaluates the loss in equation (10) across all candidate values of t<sub>step</sub> in the set {0,1,2,...,999}. This strategy was originally adopted in scVelo for optimizing the time step parameter. We have now added the corresponding citation to scVelo in the revised manuscript.
(2) Application on pancreatic endocrine datasets
(a) Lines 140-141: What is the definition of the final pseudotime-fitted time t or velocity pseudotime?
There is no distinction between “final pseudotime”, “fitted time t” and “velocity pseudotime”. All of them refer to the same quantity in our framework. To eliminate any potential ambiguity, we have standardized the terminology by replacing “final pseudotime” with “pseudotime”.
(b) Lines 143-144: The use of the velocity consistency metric to benchmark methods in multi-lineage datasets is incorrect. In multi-lineage differentiation systems, cells (e.g., those in fate priming stages) may inherently show inconsistency in their velocity. Thus, it is difficult to distinguish inconsistency caused by estimation error from that arising from biological signals. Velocity consistency metrics are only appropriate in systems with unidirectional trajectories (e.g., cell cycling). The abnormally high consistency values here raise concerns about whether the estimated velocities meaningfully capture lineage differences.
We thank the reviewer for raising this important point regarding the use of the velocity consistency metric in multi-lineage systems. Velocity consistency was initially introduced by scVelo [1] and implemented as scvelo.velocity_confidence() in its package. Velocity consistency provides one of the few widely adopted quantitative criteria for benchmarking RNA velocities [2]. We agree that it is especially suitable for single-lineage processes. For datasets with clear multi-lineage differentiation (Fig. 5 and Fig. 6), we do not use this metric, precisely to avoid the issue highlighted by the reviewer.
However, the pancreatic endocrine dataset (Fig. 2) exhibits minimal branching, making velocity consistency be more appropriate. As introduced by veloVI study, RNA velocities are supposed to change smoothly over the phenotypic manifold [3]. Higher consistency indicates that neighboring cells show compatible velocity directions, reflecting stable and coherence of the inferred velocity field. Additionally, multiple previous studies used velocity consistency to evaluate model performance on this pancreas dataset [2,3,4], providing a standard point of comparison.
To better address your concerns, we have replaced the corresponding panel in Fig. 2 of the main text with an evaluation of cell-type separability in both the traditional 2D (unspliced–spliced) phase portrait and the learned 3D (α–unspliced–spliced) phase portrait by TSvelo (Author response image 4 in our response to your subsequent question). We appreciate your suggestions, as the comparison more clearly highlights the novelty and contribution of TSvelo and helps explain its improved performance. Now, the velocity consistency panel has been moved to the Supplementary Information. In addition, we have added a clearer explanation of the cross-boundary correctness metric in the revised manuscript.
(1) Bergen, V., Lange, M., Peidli, S., Wolf, F. A., & Theis, F. J. (2020). Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology, 38(12), 1408-1414.
(2) Luo, Y., Ren, J., Yang, Q. ... & Li, Q. (2026). Benchmarking RNA velocity methods across 17 independent studies, Cell Reports Methods, 101367.
(3) Gayoso, A., Weiler, P., Lotfollahi, M., Klein, D., Hong, J., Streets, A., ... & Yosef, N. (2024). Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells. Nature Methods, 21(1), 50-59.
(4) Li, J., Pan, X., Yuan, Y., & Shen, H. B. (2024). TFvelo: gene regulation inspired RNA velocity estimation. Nature Communications, 15(1), 1387.
(c) The improvement of TSvelo over other methods in terms of cross-boundary direction correctness looks marginal; a statistical test would help to assess its significance.
We thank the reviewer for this insightful comment. In the revised manuscript, we have added statistical tests for evaluated metrics, including velocity consistency, cross-boundary direction correctness, and in-cluster coherence.
As shown in Author response image 2, TSvelo significantly outperforms all baseline methods in terms of velocity consistency across both datasets. For in-cluster coherence, TSvelo achieves significantly better performance on the gastrulation (erythroid) dataset, while on the pancreas dataset it performs comparably to the best-performing baselines (UniTVelo and TFvelo) and significantly outperforms several competing methods, including CellDancer, Dynamo, and scVelo.
For cross-boundary direction correctness, TSvelo shows consistent improvements in mean performance on the pancreas dataset (Author response image 3), and significantly outperforms Dynamo and scVelo on the gastrulation dataset. Although not all pairwise comparisons on cross-boundary direction correctness reach statistical significance, this is likely influenced by the limited number of independent samples (n = 7 and n = 4 for the two datasets, respectively), which reduces statistical power for detecting differences. Importantly, TSvelo still achieves the best average performance among all methods, indicating a consistent overall trend in favor of TSvelo.
We have added these results into the revised manuscript.
Author response image 2.
The quantitative comparison between TSvelo and baseline approaches on the pancreas dataset (panel a) and the gastrulation erythroid dataset (panel b). In each plot, methods are ranked in descending order of their mean values. Numbers at the bottom indicate the sample size for each metric. Significance is determined using a one-sided Mann–Whitney U test. *****, ***, ** and * represent p < 0.00001, 0.0001 ≤ p < 0.001, 0.001 ≤ p < 0.01, and 0.01 ≤ p < 0.05, respectively.

Author response image 3.
The comparison of mean cross-boundary direction correctness on the pancreas dataset.

(d) Lines 177-178: Based on the figure, TSvelo does not appear to clearly distinguish cell types. A quantitative metric, such as Adjusted Rand Index (ARI), should be provided.
We thank the reviewer for this helpful suggestion. To quantitatively assess whether TSvelo can distinguish cell types, we evaluated the separability of cell-type labels in both the 2D (unspliced–spliced) phase portrait adopted by previous RNA velocity approaches, and the 3D (α–unspliced–spliced, α denotes the transcriptional rate) phase portrait introduced by TSvelo.
Specifically, we evaluated how well the embedding preserves cell-type information using a k-nearest neighbors (kNN) classification accuracy with 5-fold cross-validation. Given an embedding matrix in 2D or 3D space (X 𝛜 ℝ<sup>n*d</sup>, where n is the number of cells and d is 2 or 3) and corresponding cell-type labels (y 𝛜 {1, … ,C}, we partition the data into five folds. For each fold (k), a kNN classifier with K = 5, denoted asf<sup>(k)</sup>, is trained on the training subset and evaluated on the held-out test subset. The classification accuracy for the k-th fold is defined as ℝ

where n<sub>k</sub> is the number of samples in the test set and 1(.)is the indicator function. The final score is obtained by averaging across all folds:

This metric directly assesses whether cells of the same type are positioned close to each other in the embedding space, and is widely used to quantify representation quality.
Using this evaluation, we observed that the 3D phase portrait consistently achieves significantly higher accuracy than the 2D phase portrait (Author response image 4). The improvement is highly statistically significant (one-sided Mann–Whitney U test, p-value = 4.37 × 10<sup>-10</sup>), demonstrating that the 3D representation provides substantially better separation of cell types.
We have added these quantitative results to the revised manuscript to complement the visual evidence and to clarify that TSvelo effectively distinguishes cell types in the learned representation.
Author response image 4.
The evaluation of the separability of cell-type labels in both the 2D (unspliced–spliced) phase portrait and the 3D (α–unspliced–spliced) phase portrait for the pancreas dataset.

(e) Lines 179-183: The claim that traditional methods cannot capture dynamics in the unspliced-spliced phase portrait is vague. What specific aspect is not captured-the fitted values or something else? Evidence is lacking. Please provide a detailed explanation and quantitative metrics to support this claim.
We thank the reviewer for this important comment. We have revised the text to more clearly illustrate this point using representative example genes as follows: “For instance, ANXA4 shows higher expression in Ductal cells compared to Ngn3 low EP cells, which mean its expression pattern exhibits an initial decrease followed by an increase. Such dynamics are not easily captured in the conventional unspliced–spliced phase portrait used by previous approaches, as many baseline methods implicitly assume a decreasing–then–increasing expression pattern. By comparison, TSvelo can still fit such expression pattern by using additional information from the 3D phase portrait.”
In addition, we also clarify that the 2D u–s representation has limited capacity to separate heterogeneous dynamic cell states, which can affect downstream velocity field estimation. In the conventional 2D u–s phase portrait, cells from different dynamic regimes may overlap in the same region of the embedding space. This overlap reduces the identifiability of underlying transcriptional states and makes the inferred local dynamics more ambiguous. In contrast, TSvelo introduces an additional latent variable α, forming a 3D (α, u, s) phase portrait, which helps disentangle these mixed trajectories and yields a more structured and separable representation of cell dynamics. We have provided quantitative evidence in the previous response (Author response image 4). Briefly, the proposed 3D representation achieves consistently higher kNN classification accuracy (5-fold cross-validation, k=5) for cell state identification compared to the 2D u–s embedding.
(3) Application to gastrulation erythroid datasets
(a) Lines 191-194: The observation that velocity genes are enriched for erythropoiesis-related pathways is trivial, since the analysis is restricted to highly variable genes (HVGs) from an erythropoiesis dataset. This enrichment is expected and therefore not informative.
We thank the reviewer for this comment and agree that such enrichment is expected given the use of HVGs from an erythropoiesis dataset. This analysis was included only as a preliminary sanity check to support the plausibility of the inferred velocity genes, rather than as a main result. We have accordingly simplified the description and clarified that this analysis serves only as a preliminary check in the revised manuscript.
(b) Lines 227-228: It remains unclear how TSvelo "accurately captures the dynamics." What is the definition of dynamics in this context? Figure 3g shows unspliced/spliced vs. fitted time plots and phase portraits, but without a quantitative definition or measure, the claim of superiority cannot be supported. Visualization of a single gene is insufficient; a systematic and quantitative analysis is needed.
We thank the reviewer for this important comment. We have revised the text to more clearly illustrate this point using representative example genes as follows: “For HSP90AB1, which exhibits a counter-clockwise pattern in the unspliced–spliced phase portrait, in contrast to the clockwise dynamics typically assumed by most baseline approaches, it is difficult for previous methods to capture this behavior, whereas TSvelo can still faithfully model such patterns. For genes such as RPS26, which have critical roles in the development in blood progenitors to erythroid40, the unspliced-spliced data is so noisy that cells of different types overlap in phase portrait. TSvelo can still captures the gene dynamics and reveals differences in transcription rates across cell types.”
In addition, we explicitly emphasize the role of the 3D (α, u, s) phase portrait, which provides a more structured and separable representation of transcriptional states compared to the conventional 2D u–s space. This improved representation is the key factor underlying the advantages of TSvelo in modeling transcriptional processes. In the conventional 2D u–s phase portrait, cells from different transcriptional states may overlap, leading to reduced separability. In contrast, introducing the latent variable α expands the representation to a 3D space, which helps disentangle these mixed states and yields a clearer phase structure. Similar to our previous response in Author response image 4, we provide quantitative evidence on this gastrulation erythroid dataset in Figure S7, showing that the 3D representation achieves consistently higher kNN classification accuracy for cell state separation compared to the 2D u–s embedding (one-sided Mann–Whitney U test, p-value = 0.002).
(4) Application to the mouse brain and other datasets
(a) Lines 280-281: The authors cannot claim that velocity streams are smoother in TSvelo than in Multivelo based solely on 2D visualization. Similarly, claiming that one model predicts the correct differentiation trajectory from a 2D projection is over-interpretation, as has been discussed in prior literature see PMID: 37885016.
We thank the reviewer for this important comment. Consistent with other RNA velocity studies, TSvelo employs the 2D UMAP stream plot for visualizing the results. We agree that conclusions based solely on 2D visualizations may lead to over-interpretation. Our intention was to provide an intuitive visualization rather than a rigorous quantitative comparison. Accordingly, we have revised the text to avoid making definitive claims about smoothness or correctness of differentiation trajectories based solely on 2D projections.
(b) Lines 304-306: Beyond transcriptional signal estimation, how is regulation inferred solely from scRNA-seq data validated, especially compared with scATAC-seq data? Are there cases where transcriptome-based regulatory inference is supported by epigenomic evidence, thereby demonstrating TSvelo's GRN inference accuracy?
We thank the reviewer for this important question regarding the validation of regulatory inference derived from scRNA-seq data and its comparison to scATAC-seq-based evidence.
We would like to first clarify the scope of TSvelo. Similar to existing RNA velocity methods, the primary goal of TSvelo is to model transcriptional dynamics and accurately infer cell state transitions and cell fate trajectories. In this context, gene regulatory information is not inferred de novo from data, but incorporated as prior knowledge from curated TF–target databases to guide and constrain the dynamics modeling process, as described in our Introduction.
We have conducted the requested analysis by computing gene-wise chrome accessibility rate used in MultiVelo and the learned transcription rate from TSvelo, and evaluated their correlation across genes. As shown in Figure S15, the two estimates exhibit almost no global correlation across genes, indicating that they capture substantially different aspects of regulatory information.
This discrepancy is not unexpected and reflects the fundamental differences between these modalities. scATAC-seq measures chromatin accessibility, which provides a proxy for cis-regulatory potential of genomic regions. In contrast, TF RNA expression reflects downstream transcriptional output, which is shaped by multiple regulatory layers, including post-transcriptional regulation, protein activity, temporal delays, and indirect regulation through intermediate transcriptional or signaling pathways. As a result, these two modalities are expected to capture complementary but not directly comparable aspects of gene regulation.
We acknowledge that scATAC-seq provides valuable complementary information on chromatin accessibility and regulatory potential, and will consider incorporating matched multi-omics data in future work. In the revised manuscript, we further clarify that TSvelo is an RNA velocity method that incorporates prior knowledge from curated TF–target databases, and we have added a discussion on the potential use of scATAC-seq data for future extension of our framework.
(c) The claim that TSvelo can model multi-lineage datasets hinges on its use of PAGA for lineage segmentation, followed by independent modeling of dynamics within each subset. However, the procedure for merging results across subsets remains unclear.
We thank the reviewer for pointing out that the merging step was not sufficiently described. After modeling dynamics independently within each lineage-specific subset, TSvelo integrates the results via a weighted aggregation procedure at the cell level.
For each cell and each inferred quantity (e.g., velocity or other dynamic variables), we collect the estimates obtained from different lineage-specific models and combine them using a weighted average. The weights are defined by the size of each lineage, reflecting its statistical support. We have clarified details about this merging procedure in the Methods section.
This aggregation reconciles multiple lineage-specific estimates for the same cell into a single value and mitigates discontinuities that could arise from directly combining independent lineage analyses. The resulting values define a unified set of dynamics for each cell across lineages.
Reviewer #3 (Public review):
Despite the abundance of RNA velocity tools, there are still major limitations, and there is strong skepticism about the results these methods lead to. In this paper, the authors try to address some limitations of current RNA velocity approaches by proposing a unified framework to jointly infer transcriptional and splicing dynamics. The method is then benchmarked on 6 real datasets against the most popular RNA velocity tools.
While the approach has the potential to be of interest for the field, and may present improvements compared to existing approaches, there are some major limitations that should be addressed, particularly concerning the benchmark (see major comment 1).
Major comments:
(1) My main criticism concerns the benchmarking: real data lack a ground truth, and are absolutely not ideal for comparing methods, because one can only speculate what results appear to be more plausible.
A solid and extensive simulation study, which covers various scenarios and possibly distinct data-generating models, is needed for comparing approaches. The authors should check, for example, the simulation studies in the BayVel approach (Section 4, BayVel: A Bayesian Framework for RNA Velocity Estimation in Single-Cell Transcriptomics). Clearly, all methods should be included in the simulation.
Following your recommendation, we have added the simulation analysis to compare TSvelo with existing RNA velocity approaches. We generated synthetic single-cell RNA velocity datasets using a mechanistic transcriptional dynamics model with one or multiple developmental branches. The system included 200 genes, among which 30 were designated as transcription factors (TFs).
For each branch, we independently sampled a TF–target regulatory matrix W ϵ ℝ<sup>30×200</sup> from a standard normal distribution to simulate distinct GRN structures. Gene expression dynamics were modeled using a coupled ordinary differential equation (ODE) system describing unspliced and spliced RNA abundances:


where u and s denote unspliced and spliced RNA levels, respectively. The transcription rate α was computed as a nonlinear function of TF expression, defined as a weighted sum of spliced TF abundance, followed by clipping to ensure bounded activation.
Each branch is initialized from the same randomly sampled initial condition drawn from a gamma distribution, allowing controlled divergence of trajectories driven solely by branch-specific regulatory programs.
To simulate observed sequencing counts, we introduced technical noise by scaling latent expression levels with cell-specific library sizes drawn from a log-normal distribution. The resulting expression counts were generated using a negative binomial sampling model:

where θ controls over dispersion, with smaller values corresponding to higher noise levels. The final datasets consist of paired unspliced (U) and spliced (S) count matrices with realistic transcriptional stochasticity and branching gene regulatory dynamics. For each branch, cells were further divided into three developmental stages for downstream analysis.
We evaluated TSvelo and those splicing-based RNA velocity approaches on multiple simulated datasets with varying numbers of branches and noise levels. There are one, two or three branches start from the same cell group in these datasets (Branch 1: stage 0 - stage 1 - stage 2. Branch 2: stage 0 - stage 3 - stage 4. Branch 3: stage 0 - stage 5 - stage 6). We primarily assessed performance using the cross-boundary direction correctness (CBDir) metric, as it directly evaluates inferred trajectories against ground-truth cell stage annotations, which have been widely adopted in RNA velocity studies such as VeloAE and UniTvelo. In detail, Cross-boundary direction correctness assesses the accuracy of transitions from a source cluster to a target cluster by examining the boundary cells, and requires ground truth annotations. We directly run the function unitvelo.evaluate() provided in UniTVelo to obtain the Cross-boundary direction correctness. In detail, the CBDir is calculated as follows:

where θ controls over dispersion, with smaller values corresponding to higher noise levels. The final datasets consist of paired unspliced (U) and spliced (S) count matrices with realistic transcriptional stochasticity and branching gene regulatory dynamics. For each branch, cells were further divided into three developmental stages for downstream analysis.
where C<sub>A</sub> denotes the set of cells in the target cluster A, and N(c) represents the neighboring cells of a given cell c v<sub>c</sub> and x<sub>c</sub> denote the low-dimensional velocity and state vectors of cell c, respectively, and x<sub>c’</sub> denotes the state vector of its neighboring cell.
As shown in Figure S2, TSvelo consistently achieves the highest accuracy across all simulation settings, particularly in scenarios with complex branching structures, which pose significant challenges for baseline methods.
(2) Related to the above: since a ground truth is missing, the real data analyses need to be interpreted with caution. I recommend avoiding strong statements, such as "successfully captures the correct gene dynamics", or "accurately infer", in favour of milder statements supported by the data, such as "... aligns with the biological processes described" (as in page 12), or "results are compatible with current biological knowledge", etc...
We thank the reviewer for this helpful comment. We agree that analyses on real datasets should be interpreted with appropriate caution because definitive ground truth is typically unavailable. Following the reviewer’s suggestion, we have revised the wording throughout the manuscript to avoid overly strong claims. For example, statements such as “successfully captures the correct gene dynamics” and “accurately infer” have been replaced with more cautious descriptions such as “consistent with known biological processes”.
(3) Many methods perform RNA velocity analyses. While there is a brief description, I think it'd be useful to have a schematic summary (e.g., via a Table) of the main conceptual, mathematical, and computational characteristics of each approach.
We thank the reviewer for this insightful suggestion. We agree that a structured summary of existing RNA velocity methods would improve clarity and accessibility. We have added a new summary table (Table S1) that systematically compares representative RNA velocity approaches in the supplementary information.
(4) Related to the above: I struggled to identify the main conceptual novelty of TSvelo, compared to existing approaches. I recommend explaining this aspect more extensively.
We thank the reviewer for this insightful comment. We agree that the conceptual novelty of TSvelo can be more clearly articulated.
In the revised manuscript, we have expanded the discussion at the beginning of the Results section to explicitly highlight the key distinctions between TSvelo and existing approaches. Specifically, we now clarify that most existing RNA velocity methods predominantly focus on splicing dynamics and typically operate in a gene-wise manner, without capturing coordinated dynamics across genes. In contrast, TSvelo models the full cascade of transcriptional regulation, transcription, and splicing within a unified framework, and estimates RNA velocity jointly across all genes, thereby capturing their coordinated dynamics at the system level.
(5) A computational benchmark is missing; I'd appreciate seeing the runtime and memory cost of all methods in a couple of datasets.
We thank the reviewer for this helpful suggestion regarding computational benchmarking. In the revised manuscript, we have added a systematic comparison of runtime and GPU memory usage across TSvelo and ba methods using simulated datasets of increasing scale (600, 1200, and 1800 cells) on our NVIDIA GeForce RTX 3090 device with 24 GB memory.
Table S2 shows differences in computational efficiency and resource requirements among methods. Specifically, classical methods such as scVelo and Dynamo exhibit very fast runtimes (10–24 seconds) and do not rely on GPU acceleration, reflecting their relatively lightweight modeling strategies. In contrast, deep learning–based approaches, including UniTVelo, cellDancer, and TSvelo, have higher computational costs due to their increased model complexity.
TSvelo exhibits a stable GPU memory footprint (~1.26 GB) across different dataset sizes, indicating that its memory usage is primarily determined by model architecture rather than the number of cells. This level of memory consumption is well within the capacity of modern GPUs and does not pose practical limitations. In terms of runtime, TSvelo scales approximately linearly with dataset size. The higher computational cost of TSvelo is mainly due to its EM-style optimization procedure, where each M-step also involves multiple optimization updates to infer gene regulatory effects in a global model. This design enables TSvelo to explicitly incorporate regulatory priors and jointly model gene interactions, which is not supported by these baseline methods.
To further improve runtime efficiency, TSvelo allows flexible control of the number of EM iterations. As shown in Figure S16 and Table S3, we evaluated performance under different iteration settings on the simulation dataset. The early stopping strategy employed in the EM framework of TSvelo, which will stop modeling if the loss is not further reduced in the last 3 iterations. Results show that convergence is typically achieved within 3 iterations for this dataset, and increasing the maximum number of iterations beyond this does not further change the results. Notably, even a single iteration already yields competitive performance, likely benefiting from the strong initialization based on unspliced-to-spliced temporal delay.
Overall, these results highlight a trade-off between computational efficiency and modeling expressiveness. While TSvelo is more computationally demanding than classical approaches, it provides a more flexible framework for incorporating regulatory information and capturing complex gene interactions, which we believe justifies the additional computational cost in scenarios requiring accurate dynamical inference.
(6) I think BayVel (mentioned above) should be added to the list of competing methods (both in the text and in the benchmarks). The package can be found here: https://github.com/elenasabbioni/BayVel_pkgJulia.
We thank the reviewer for suggesting BayVel and for providing the repository link. We carefully review the available resources, including both the BayVel_pkgJulia and the BayVel_notebooks, and we appreciate the authors’ efforts in making their code and data publicly available.
We note that BayVel repositories primarily provide scripts and data for reproducing the figures and results reported in their manuscript. However, at present, the available resources do not yet provide a complete guideline or standardized pipeline for applying BayVel to new datasets. To ensure a fair and reproducible comparison, we therefore tend to use BayVel results officially provided by the authors. We are grateful that the BayVel results on the pancreas dataset is released at BayVel_notebooks page: https://github.com/elenasabbioni/BayVel_notebooks/tree/main/real%20data/Pancreas/moments/output.
Based on these results, we conducted comparisons across all methods on the pancreas dataset, with quantitative evaluations shown in Author response image 55. In each plot, methods are ranked in descending order of their mean values. Numbers at the bottom indicate the sample size for each metric. Statistical significance is assessed using a one-sided Mann–Whitney U test, where *****, ***, **, and * denote p < 0.00001, 0.0001 ≤ p < 0.001, 0.001 ≤ p < 0.01, and 0.01 ≤ p < 0.05, respectively.
BayVel has now been included in the Introduction, and corresponding comparisons have been added in the revised manuscript.
Author response image 5.
The quantitative comparison between TSvelo and baseline approaches on the pancreas dataset. In each plot, methods are ranked in descending order of their mean values. Numbers at the bottom indicate the sample size for each metric. Significance is determined using a one-sided Mann–Whitney U test. *****, ****,***, ** and * represent p < 0.00001, 0.00001 ≤ p < 0.0001, 0.0001 ≤ p < 0.001, 0.001 ≤ p < 0.01, and 0.01 ≤ p < 0.05, respectively.

Recommendations for the authors:
Reviewer #1 (Recommendations for the authors):
Please carefully proofread the text. Some typos:
(1) Line 110: differentia -> differential.
(2) Line 280: ".," to be corrected.
(3) Line 566: optimize -> optimizes.
We thank the reviewer for carefully proofreading the manuscript and for pointing out these typographical errors. We have corrected the identified typos in the revised manuscript.
Reviewer #3 (Recommendations for the authors):
(1) Regarding Major Comment 1 in the Public Review, I contacted BayVel authors, who told me that they'll upload all their scripts here within a few days: https://github.com/elenasabbioni/BayVel_notebooks
Thank you very much for reaching out to the BayVel authors. We sincerely appreciate the BayVel authors’ efforts to make their scripts and results publicly available through BayVel_notebooks. We believe this is a valuable contribution that will greatly benefit the community.
We have followed the repository and have now included BayVel in the revised manuscript, with corresponding comparisons added to both the main text and the benchmarking results.
(2) Page 9 mentions "consistency", "coherence", and "correctness". Instead of these qualitative (and potentially subjective) evaluations, I'd appreciate using quantitative metrics or visual descriptions when differences are visually clear.
We thank the reviewer for this insightful comment. The terms “velocity consistency,” “in-cluster coherence,” and “cross-boundary correctness” used in our manuscript are not intended as subjective descriptions. They correspond to commonly used evaluation criteria in this field and have been adopted as quantitative metrics in previous studies, such as VeloAE[1] and UniTVelo[2]. We have incorporated the following updated definition into the Methods section.
(1) Velocity consistency (VCon). We used the scvelo.velocity_confidence() function from scVelo to evaluate velocity consistency, interpreting the results as a measure of how consistent velocities are within neighboring cells. Velocity consistency is especially suitable for evaluating the RNA velocity modeling on single lineage. For each cell , the velocity consistency is calculated as follows:

Where N (c) represents the neighboring cells of a given cell c v<sub>c</sub> v<sub>c’</sub> denote the low-dimensional velocity vectors of cell cand its neighboring cell c’.
(2) Cross-boundary direction correctness (CBDir). Cross-boundary direction correctness assesses the accuracy of transitions from a source cluster to a target cluster by examining the boundary cells, and requires ground truth annotations. We directly run the function unitvelo.evaluate() provided in UniTVelo to obtain the Cross-boundary direction correctness. In detail, the CBDir is calculated as follows:

Where C<sub>A</sub> denotes the set of cells in the target cluster A, and represents the neighboring cells of a given cell c v<sub>c</sub> v<sub>c’</sub> denote the low-dimensional velocity and state vectors of cell cand its neighboring cell c’.
(3) Within-cluster velocity coherence (ICCoh). Within-cluster velocity coherence measures the coherence of velocities within a single cluster using a cosine similarity score between cell velocities. We applied the function unitvelo.evaluate() provided by UniTVelo to directly compute the within-cluster velocity coherence. Using the same notation as defined above, the CBDir is calculated as follows:

(1) Qiao, C. & Huang, Y. Representation learning of RNA velocity reveals robust cell transitions. Proceedings of the National Academy of Sciences 118, e2105859118 (2021).
(2) Gao, M., Qiao, C. & Huang, Y. UniTVelo: temporally unified RNA velocity reinforces single-cell trajectory inference. Nature Communications 13, 6586 (2022).
(3) At page 3, some objects are not defined after formula (3):
ReLU finction, and w_gi
Additionally, parenthesis of ReLU function should be bigger.
We thank the reviewer for pointing this out. In the revised manuscript, we have explicitly defined the ReLU activation function and clarified that w<sub>gi</sub> represents the regulatory weight of TF i on the target gene g. In addition, we have adjusted the formatting of Eq. (3) by enlarging the parentheses in the ReLU function to improve readability.