Reviewer #2 (Public review):
Summary of goals:
Untranslated regions are key cis-regulatory elements that control mRNA stability, translation, and translocation. Through interactions with small RNAs and RNA binding proteins, UTRs form complex transcriptional circuitry that allows cells to fine-tune gene expression. Functional annotation of UTR variants has been very limited, and improvements could offer insights into disease relevant regulatory mechanisms. The goals were to advance our understanding of the determinants of UTR regulatory elements and characterize the effects of a set of "disease-relevant" UTR variants.
Strengths:
The use of a massively parallel reporter assay allowed for analysis of a substantial set (6,555 pairs) of 5' and 3' UTR fragments compiled from known disease associated variants. Two cell types were used.
The findings confirm previous work about the importance of AREs, which helps show validity and adds some detailed comparisons of specific AU-rich motif effects in these two cell types.
Using a Lasso regression, TA-dinucleotide content is identified as a strong regulator of RNA stability in a context dependent manner based on GC content and presence of RNA binding protein binding motifs. The findings have potential importance, drawing attention to a UTR feature that is not well characterized.
The use of complementary datasets, including from half-life analyses of RNAs and from random sequence library MRPA's, is a useful addition and supports several important findings. The finding the TA dinucleotides have explanatory power separate from (and in some cases interacting with) GC content is valuable.
The functional enrichment analysis suggests some new ideas about how UTRs may contribute to regulation of certain classes of genes.
Weaknesses:
In this section, original reviewer comments about the initial submission and the responses of the authors are listed together with new reviewer responses to the authors:
Reviewer original comment 1: It is difficult to understand how the calculations for half-life were performed. The sequencing approach measures the relative frequency of each sequence at each time point (less stable sequences become relatively less frequent after time 0, whereas more stable sequences become relatively more frequent after time 0). Since there is no discussion of whether the abundance of the transfected RNA population is referenced to some external standard (e.g., housekeeping RNAs), it is not clear how absolute (rather than relative) half-lives were determined.
Author response: [The authors showed the equations used to calculate half lives based on read counts.] They stated that "The absolute abundance was not required for the half-life calculation."
Reviewer response to authors: The methods section states that DESeq2 was used to normalize read counts. DESeq2 normalization assumes that levels of most RNAs are not different between samples. That assumption is not valid here, since RNAs in the library are introduced into cells at time 0 and all RNAs decrease over time. If DESeq2 is applied without modification to normalize across timepoints, normalized reads from less stable RNAs will decrease over time (as expected) but normalized reads from more stable RNAs will increase. Can the authors please clarify in the methods how the read counts were normalized to account for this issue?
Reviewer original comment 2: Fig. S1A and B are used to assess reproducibility. They show that read counts at a given time point correlate well across replicate experiments. However, this is not a good way to assess reproducibility or accuracy of the measurements of t1/2 are. (The major source of variability in read counts in these plots - especially at early time points - is likely starting abundance of each RNA sequence, not stability.) This creates concerns about how well the method is measuring t1/2. Also creating concern is the observation that many RNAs are associated with half-lives that are much longer than the time points analyzed in the study. For example, based upon Figure S1 and Table S1 correctly, the median t1/2 for the 5' UTR library in HEK cells appears to be >700 minutes. Given that RNA was collected at 30, 75, and 120 minutes, accurate measurements of RNAs with such long half lives would seem to be very difficult.
Author response: ... The calculation of the half-life involves first determining the decay constant 𝜆, which represents a constant rate of decay. Since 𝜆 is a constant, it is possible to accurately calculate it without needing data over the entire decay range. Our experimental design considers this by selecting appropriate time points to ensure a reliable estimation of 𝜆, and thus, the half-life. To determine the most suitable time points, we conducted preliminary experiments using RT-PCR. These experiments indicated that 30, 75, and 120 minutes provided an effective range for capturing the decay dynamics of the transcripts.
Reviewer response to author comments: Based on Fig. S1D, for 3' UTRs in both cell types and for 5' UTRs in SH-SY5Y cells, median t1/2 is in the range of ~30 to 90 minutes (corresponding to ln t1/2 = 3.5 to 4.5). Measuring RNAs at 30, 75, and 120 minutes would therefore be a good choice for these cases, However, median t1/2 in HEK cells appears to be ~600 minutes (corresponding to ln t1/2 ~6.4) for HEK cells. For t1/2 of 600 minutes, RNA levels at the final time point (120 minutes) would be 90% of the those at the first time point (30 minutes), which illustrates why the method would need to be able to reliably capture very small changes in RNA abundance to accurately measure t1/2 for transcripts with half-lives much longer than 120 minutes. As suggested in our original review, this concern could be addressed by showing the correlation of half-lives across replicates for the 5' and 3' UTR libraries in both cell types. Alternatively, the authors could show other measures of reproducibility for the half-life measurements across replicates. This requires no additional experimentation and can be done using the data from replicate runs shown in Fig. S1A and B. We remain concerned that for sequences with very long half-lives, extrapolating the half-life from small changes between 30 and 120 minutes will lead to imprecise measurements.
Reviewer original comment 3: There is no direct comparison of t1/2 between the two cell types studied for the full set of sequences studied. This would be helpful in understanding whether the regulatory effects of UTRs are generally similar across cell lines (as has been shown in some previous studies) or whether there are fundamental differences. The distribution of t1/2's is clearly quite different in the two cell lines, but it is important to know if this reflects generally slow RNA turnover in HEK cells or whether there are a large number of sequence-specific effects on stability between cell lines. A related issue is that it is not clear whether the relatively small number of significant variant effects detected in HEK cells versus SH-SY5Y cells is attributable to real biological differences between cell types or to technical issues (many fewer read counts and much longer half lives in HEK cells).
Author response: For both cell lines, we selected oligonucleotides with R2 > 0.5 and mean squared error (MSE) < 1 for analysis when estimating half-life (λ) by linear regression. This selection criterion was implemented to minimize the effect of experimental noise. After quality control, we selected common UTRs and compared the RNA half-lives of the two cell lines using a scatter plot. The figure below shows that RNA half-lives are quite different between the cell lines, with a moderate similarity observed in the 5' UTRs (R = 0.21), while the correlation in the 3' UTRs is non-significant. Despite the low correlation of mRNA half-life between the two cell lines, UA-dinucleotide and UA-rich sequences consistently emerge as the most significant destabilizing features, suggesting a shared regulatory mechanism across diverse cellular environments.
Reviewer response to author comments: We appreciate that the authors shared this additional analysis of the data. We believe that this is an important finding and that the additional figure showing correlations of half-lives across cell types should be included in the manuscript or supplement. Discussion of this result in the manuscript would also be useful for readers. This result is surprising to us since we would have expected that widely expressed RNA-binding proteins would have led to more similar effects between the two cell types, as previously found using other approaches (e.g., studies of 3' UTR effects in MPRAs). It would also be appropriate to discuss that differences seen between the two cell types indicate that caution is warranted when trying to generalize the results of this study to other cell types.
Reviewer original comment 4 has been addressed adequately in the revised manuscript.
Appraisal and impact:
Reviewer original comment 1: The work adds to existing studies that previously identified sequence features, including AREs and other RNA binding protein motifs, that regulate stability and puts a new emphasis on the role of "TA" (better "UA") dinucleotides. It is not clear how potential problems with the RNA stability measurements discussed above might influence the overall conclusions, which may limit the impact unless these can be addressed.
It is difficult to understand whether the importance of TA dinucleotides is best explained by their occurrence in a related set of longer RBP binding motifs (see Fig 5J, these motifs may be encompassed by the "WWWWWW cluster") or whether some other explanation applies. Further discussion of this would be helpful. Does the LASSO method tend to collapse a more diverse set of longer motifs that are each relatively rare compared to the dinucleotide? It remains unclear whether TA dinucleotides are associated with less stability independent of the presence of the known larger WWWWWWW motif. As noted above, the importance of TA dinucleotides in the HEK experiments appears to be less than is implied in the text.
Author response: To ensure the representativeness of the features entered into the LASSO model, we pre-selected those with an occurrence greater than 10% among all UTRs. There is no evidence to support a preference for dinucleotides by LASSO. To address whether the destabilizing effect of UA dinucleotides is part of the broader WWWWWW motif, we divided UA dinucleotides into two groups: those within the WWWWWW motif and those outside of it. Specifically, we divided UTRs into two categories: 'at least one UA within a WWWWWW motif' and 'no UA within a WWWWWW motif,' and visualized the results using a boxplot. As shown in [figures provided to the reviewers], the destabilizing trend still remains for UA dinucleotides outside of the WWWWWW motif, although the effect appears to be more pronounced when UA is within the WWWWWW motif. This suggests that while UA dinucleotides have a destabilizing effect independently, their impact is amplified when they are part of the broader WWWWWW motif.
Reviewer response to authors: These are useful additional analyses, and we suggest that the additional figure and discussion should be included in the manuscript/supplement so that readers can benefit from them.
Reviewer original comment 2: The inclusion of more than a single cell type is an acknowledgement of the importance of evaluating cell type-specific effects. The work suggests a number of cell type-specific differences, but due to technical issues (especially with the HEK data, as outlined above) and the use of only two cell lines, it is difficult to understand cell type effects from the work.
The inclusion of both 3' and 5' UTR sequences distinguishes this work from most prior studies in the field. Contrasting the effects of these regions on stability is of interest, although the role of these UTRs (especially the 5' UTR) in translational regulation is not assessed here.
Author response: We examined the role of UTR and UTR variants in translation regulation using polysome profiling. By both univariate analysis and an elastic regression model, we identified motifs of short repeated sequences, including SRSF2 binding sites, as mutation hotspots that lead to aberrant translation. Furthermore, these polysome-shifting mutations had a considerable impact on RNA secondary structures, particularly in upstream AUG-containing 5' UTRs. Integrating these features, our model achieved high accuracy (AUROC > 0.8) in predicting polysome-shifting mutations in the test dataset. Additionally, metagene analysis indicated that pathogenic variants were enriched at the upstream open reading frame (uORF) translation start site, suggesting changes in uORF usage underlie the translation deficiencies caused by these mutations. Illustrating this, we demonstrated that a pathogenic mutation in the IRF6 5' UTR suppresses translation of the primary open reading frame by creating a uORF. Remarkably, site-directed ADAR editing of the mutant mRNA rescued this translation deficiency. Because the regulation of translation and stability does not converge, we illustrate these two mechanisms in two separate manuscripts (this one and doi.org/10.1101/2024.04.11.589132).
Reviewer response to authors: This is useful context. No further comment.