On 2016 Mar 13, Tamir Tuller commented:
We (and everyone) know that correlation on binning may be misleading if reported in a non-transparent way; specifically binning tends to increase the correlation (but usually has a much weaker effect on the p-value). However, we frankly do not understand why this ‘lecturing’ about the topic appears here (next to our study). Our study includes various sophisticated statistical tests, the binning procedure is described at the beginning in a coherent manner, the correlation for various bin sizes is also reported (one can learn about the relation between binning and correlation simply by looking at figure S5 without the need for this unnecessary correspondence :-)), etc. Thus, we believe that the nature of the signal and the challenging data should be very clear to readers who thoroughly read the paper (but we guess that it may be misleading, as any other paper would, if you do not bother to read all the details :-)).
The statistical analysis in papers in our field (if they are performed accurately) should consider various aspects including non-trivial biases in the data, discretization, various confounding variables/explanations, various aspects of molecular evolution, huge datasets, etc. Thus, the reader and not only the author should consider them when evaluating the results; specifically, the strength of a correlation should be evaluated in the light of all these aspects. The aim of mentioning other papers and ‘top statisticians’ was to demonstrate that there are many people (as opposed to Plotkin/Shah/Cherry) that do understand this point.
If the number of points in a typical systems biology study is ~300, the number of points analyzed in our study is 1,230,000-fold higher (!); a priori, a researcher with some minimal experience in the field should not expect to see similar levels of correlations in the two cases. Everyone also knows that increasing the number of points, specifically when dealing with non trivial NGS data, also tends to very significantly decrease the correlation. The aim of the binning was to align our signal to previously reported signals in the field (in terms of number of points), and as mentioned the paper includes many other analyses that give the reader a greater context for the signal (including an explicit graph reporting the relation between bin size and the correlation); in addition the non-binned correlation (0.02-0.07) is comparable to the level of correlation between two Hi-C measurements (~0.05) from different labs (!). It is clear that a typical signal in our field (e.g. higher than the correlation of 0.12 or even the “high” 0.38 mentioned in your paper Weinberg DE, 2016) if transferred via such a noisy/biased ‘channel’ with increased number of points will be order of magnitudes lower than our non-binned data.
We, of course, do not expect that further back-and-forth will convince Plotkin and Shah of our points. But hopefully this exchange will at least have some value to the field for scientists who work to draw inferences from genomic datasets; specifically, we hope that other scientists will learn to thoroughly consider all the aspects mentioned above and below when reading/writing a scientific paper.
BTW: regarding the correlation 0.12 that was improved to 0.38 in the new study. In the new study (Weinberg DE, 2016) you still did not perform many of the required statistical controls (among others control for Kozak sequence and AA bias) according to our review [http://www.cs.tau.ac.il/~tamirtul/Shah_et_al_review.pdf].
Tamir Tuller & Alon Diament, Tel-Aviv University, March 13, 2016
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.