1. Last 7 days
    1. AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Amichai Painsky

      This paper introduces a Good-Turing (GT) estimation scheme for relative gene expression estimation and cell-cell distance estimation. The proposed methods, namely GTestimate, claims to improve upon conventional normalization methods by accounting for unobserved genes. The idea behind this contribution is fairly straightforward - since the relative gene expression is of large alphabet, a GT estimator is expected to preform better than a naive ML approach. However, I am not convinced that the authors applied it correctly. First, the proposed GT estimator (as appears in (GT)) in the text), assigns a zero estimate to unobserved genes (Cg = 0). This contradicts the entire essence of using a GT estimator. Second, it makes no since to use this expression for every Cg > 0. In fact, any reasonable GT based estimator applies GT for relatively small Cg, and ML estimator for large Cg. See [1] for a through discussion. The choice of a threshold between "small" and "large" Cg's is subject to many studied (for example [2], [1]), but it makes no sense to use the above expression for any Cg. Finally, notice that if N_{Cg} > 0 for some g but N_{Cg+1} = 0, the proposed estimator is not defined. There exists several smoothing solutions for such cases (for example [3]), but they need to be properly discussed. to conclude, I am not sure what is the effect of these issues on the experiments in the paper, which makes it difficult to assess the results.

      REFERENCES

      [1] A. Painsky, "Convergence guarantees for the good-turing estimator," Journal of Machine Learning Research, vol. 23, no. 279, pp. 1-37, 2022. [2] E. Drukh and Y. Mansour, "Concentration bounds for unigram language models." Journal of Machine Learning Research, vol. 6, no. 8, 2005. [3] W. A. Gale and G. Sampson, "Good-Turing frequency estimation without tears," Journal of quantitative linguistics, vol. 2, no. 3, pp. 217-237, 1995.