Reviewer #1 (Public Review):
Einarsson et al have produced CAGE data from EBV-immortalised lymphoblastoid cells from more than a hundred individuals from two genetically diverse African populations (YRI and LWK), and used it to study how sequence variation affects the activity of promoters at the level of expression variability and at the level of transcription start site usage within promoters across individuals.
The dataset is very exciting, and the analyses were performed carefully and described well. The results show that promoters in the genome vary a lot with respect to their expression variability across individuals and that their level of variability is closely associated with their biological function and their sequence and architectural features. These results are often confirmatory - it is well established that promoters have different architectures associated with different sequence elements, different types of gene regulation and even differences across individual cells. In general, the multifarious observations boil down to one key distinction:
- Regulated genes have promoters that look and act differently from those of housekeeping genes.
While this is unsurprising, the authors then proceed to analyse other underlying differences between low variability (mostly housekeeping) and high variability (overwhelmingly regulated) promoters. Several observations have alternative and sometimes more elegant explanations if some of the previously worked out properties of housekeeping vs regulated promoters are taken into consideration:
- The authors are keen to interpret the architectural features of ubiquitously expressed (housekeeping) promoters as selected for robustness against mutations in ensuring stable and steady expression levels. However, there are some known facts about both housekeeping and regulated promoters that make alternative interpretations plausible.
- When discussing broad promoters, the authors disregard the well known fact that the most commonly used transcription start positions are those with YR sequence at (-1,+1) position. Any mutation within the span of broad promoter cluster that removes an existing YR or introduces a new one has the capacity to change both the TSS distribution pattern and overall level of expression of that promoter - but only slightly. This way, broad promoters can be viewed as adaptation not for robustness but for ability to take many mutations with small effect size that will drive any _positive_ selection smoothly across a changing fitness landscape.<br />
- Indeed, the main property of low variability promoters is that there isn't a single nucleotide change (either substitution or indel) that can substantially change their activity. (In that they are clearly different from e.g. TATA-dependent promoters, where one change can abolish TBP binding or deprive the promoter of a YR dinucleotide at a suitable distance from the TATA box.) This is achieved by their dependence on broad and weak sequence signatures such as GC composition and nucleosome positioning signal. However, most such genes are not known to have a strict requirement for dosage control. On the contrary, dosage seems to be much more critical for the functional classes that in the authors' analysis show variable expression.<br />
- Whether it is a removal of YR dinucleotide, introduction of a new one, or the change of nucleosome positioning, it seems that the transcription level from housekeeping, low variability promoters is unaffected, or at least affected mildly enough that it is not within the statistical power of the CAGE data across different individuals to detect the difference. Rather than robustness, it can be interpreted as competition - the architecture recruits preinitiation complex at a fairly constant rate, and it is the different YR positions that "compete" for serving as transcription initiation position, with the CAGE signal reflecting the relative effectiveness of each position in that competition. If one of the YR dinucleotides is removed, often the other, neighbouring ones will be used instead. The same might happen for potential multiple nucleosome positioning signals - if one becomes less efficient at stopping a nucleosome, another will be used more often.<br />
- The fact that decomposed parts of housekeeping promoters add up to approximately the same expression level across individuals even when they are uncorrelated point that they might actually be anticorrelated - indeed, the UFSP2 plot in Figure 4E looks like the two decomposed promoters are anticorrelated. That would argue against the independence of the decomposed promoters - indeed it may again point to "competition" where the decrease in use of one will simply shift most initiation events to the other.<br />
- In general, not everything is a result of direct evolutionary selection, and that is what should have clear landmarks of purifying selection. On the contrary, promoters, especially housekeeping promoters, have vastly different nucleotide and dinucleotide compositions across Metazoa, both at large and at relatively short distances, which means they can undergo concerted evolution as a group, which means they should be "robust" to mutations in a way that allows them to change much more and more rapidly than some other promoter architectures - especially TATA-dependent architectures whose key elements and spacing between them haven't substantially changed for more than a billion years, and possibly longer.
- While housekeeping promoters are broad but mostly not among the broadest, regulated promoters can be either broad or narrow. This is also known - while narrow promoters are overrepresented for tissue-specific and non-CGI promoters, promoters of Polycomb-bound developmental genes are often broad and have large CpG islands; the latter may account for some of the broadest CAGE clusters observed in the data. It would be an interesting finding if both TATA-dependent and developmental promoters were found to be variable across individuals in a non-trivial way (the trivial way being the variability due to larger dynamic range of their expression - e.g. the expression of SIX3 in many cell types is basically zero, while the dynamic range of RPL26L1 is very limited) - this should be checked by analysing them separately.
- While broad promoters can be decomposed into subclusters with differential expression across individuals, the authors do not seem to allow for the decomposition of intertwined TSS positions within the cluster, but rather postulate hard boundaries between subclusters. This is different from e.g. overlapping maternal and zygotic promoter use (Haberle et al Nature 2014), where the distribution of the used TSS positions is different but the clusters can overlap.
- Both Dreos et al (PLOS Comp Biol 2016) and Haberle et al. (2014) show that one stable element of a broad promoter is the positioning signal of its first downstream nucleosome. As seen very convincingly in both Drosophila and zebrafish, the dominant TSS position of the broad promoter is highly predictive of the position of first downstream nucleosome and its underlying positioning sequence, and the most plausible interpretation is that there is an "optimal" distance from nucleosome for transcriptional initiation, resulting in the dominant (i.e. most often used) TSS position. In mammals, broad promoters are even broader than in those two species and might have multiple nucleosome positioning signals they can use. In such cases, mutations in one of the nucleosome positioning signals, or indels changing the spacing between the nucleosome and the part of sequence that contains TSS, might lead to differential use of one nucleosome signal vs other. This would be compatible with the authors' observations in low variability promoters that decomposed promoters are used to different extends in different individuals.
- If we were to look for sources of difference other than the actual sequence architecture, some differences between regulated and unregulated promoters can be explained by the key difference: the regulation of regulated genes comes from outside the core promoter; the regulation of housekeeping genes is largely dependent on the intrinsic activity of the core promoter itself. This way, for example, in the absence of a causative variant in the promoter itself, the observed variability in the SIX3 promoter might not be encoded in the promoter itself - instead, enhancer responsiveness might be encoded in the promoter, and the variability itself could be due an enhancer that can be hundreds of kilobases away. Such a scenario combined with broad promoter would likely result in decomposed promoters that are highly correlated across individuals - because they are both externally controlled by the same regulatory inputs.