Reviewer #2 (Public review):
Summary:
This paper presents a very interesting use of a causal graph framework to identify the "root genes" of a disease phenotype. Root genes are the genes that cause a cascade of events that ultimately leads to the disease phenotype, assuming the disease progression is linear.
Strengths:
- The methodology has a solid theoretical background.<br /> - This is a novel use of the causal graph framework to infer root causes in a graph
Weaknesses:
(1) General Comments<br /> First, I have some general comments. I would argue that the main premise of the study might be inaccurate or incomplete. There are three major attributes of real biological systems, which are not considered in this work.
One is that the process from health-to-disease is not linear most of the time with many checks along the way that aim to prevent the disease phenotype. This leads to a non-deterministic nature of the path from health-to-disease. In other words, with the same root gene perturbations, and depending on other factors outside of gene expression, someone may develop a phenotype in a year, another in 10 years and someone else never. Claiming that this information is included in the error terms might not be sufficient to address this issue. The authors should discuss this limitation.
Two, the paper assumes that the network connectivity will remain the same after perturbation. This is not always true due to backup mechanisms in the cells. For example, suppose that a cell wants to create product P and it can do it through two alternative paths:<br /> Path #1: A -> B -> P Path #2: A -> C -> P<br /> Now suppose that path #1 is more efficient, so when B can be produced, path #2 is inactive. Once the perturbation blocks element B from being produced, the graph connectivity changes by activation of path #2. I did not see the authors taking this into consideration, which seems to be a major limitation in using perturb-seq results to infer connectivities.
Three, there is substantial system heterogeneity that may cause the same phenotype. This goes beyond the authors claim that although the initial gene causes of a disease may differ from person to person, at some point they will all converge to changes in the same set of "root genes". This is not true for many diseases, which are defined based on symptoms and lab tests at the patient level. You may have two completely different molecular pathologies that lead to the development of the same symptoms and test results. Breast cancer with its subtypes is a prime example of that. In theory, this issue could be addressed if there is infinite sample size. However, this assumption is largely violated in all existing biological datasets.
All the above limit the usefulness of this method for most chronic diseases, although it might still lead to interesting discoveries in cancer (in which the association between genes' dysregulation and development of cancer is more direct and occurs in less amount of time).
With these in mind, the theoretical and algorithmic advances this paper offers are interesting. And the theoretical proofs are solid.
(2) Specific comments.<br /> I am curious how the simulated data were generated and processed. Specifically, were the values of the synthetic variables Z-scored? If not, then I would expect that the variance of every variable will increase from the roots of the graph to the leaves. That will give an advantage in any algorithm aiming to identify causal relations based on error terms. For fairness and completeness, the authors should Z-score the values in the synthetic data and compare the results.
The algorithm seems to require both RNA-seq and Perturb-seq data (Algorithm 1, page 14). Can it function with RNA-seq data only? What will be different in this case?
(3) Additional comments:<br /> Although the manuscript is generally written clearly, some parts are not clear and others have missing details that make the narrative difficult to follow up. Some specific examples:<br /> - Synthetic data generation: how many different graphs (SEMs) did they start from? (30?) How many samples per graph? Did they test different sample sizes?<br /> - The presentation of comparative results (Suppl fig 4 and 7) is not clear. No details are given on how these results were generated. (what does it mean "The first column denotes the standard deviation of the outputs for each algorithm"?) Why all other methods have higher SD differences than RCSP? Is it a matter of scaling? Shouldn't they have at least some values near zero since the authors "added the minimum value so that all histograms begin at zero"? also, why RCSP results are more like a negative binomial distribution and every other is kind of normal?<br /> - What is the significance of genes changing expression "from left to right" in a UMAP plot? (eg Fig. 3h and 3g)
The authors somewhat overstate the novelty of their algorithm. Representation of GRNs as causal graphs dates back in 2000 with the work of Nir Friedman in yeast. Other methods were developed more recently that look on regulatory network changes at the single sample level which the authors do not seem to be aware (e.g., Ellington et al, NeurIPS 2023 workshop GenBio and Bushur et al, 2019, Bioinformatics are two such examples). The methods they mention are for single cell data and they are not designed to connect single sample-level changes to a person's phenotype. The RCS method needs to be put in the right background context in order to bring up what is really novel about it.