On 2020-07-03 16:41:10, user Laura Sanchez wrote:
Dear Eldjárn et al, this preprint was discussed in a lab meeting and we would like to offer the following for review. Thank you for posting this very interesting manuscript. Best, The Sanchez Lab
The manuscript by Eldjárn et al. describes the development of a computational method to enable large-scale linking of gene cluster families (GCFs) and molecular families (MFs). This method uses multiple complementary scoring functions, combining both feature-based and correlation-based approaches, which, the authors state, allows for more effective prioritization of valid links between GCFs and MFs than using the individual scoring functions. The manuscript provided a very nice summary which integrated information from many different fields to set up the problem they were trying to address in the introduction. However, the utility of this method could not be tested as the documentation was absent from the repository links and the manuscript could benefit from a concrete example (actual metabolite linked to a specific BGC) to more effectively show its advantages over current techniques. Below is a list of major and minor critiques for this preprint.
Major:
Figure 1 could be re-done to better visually demonstrate the problem the authors are trying to address. For instance, a box around the higher scoring population would be helpful for the reader to understand the problem as the numbers in the figure legend are difficult to correlate to the visual. The purpose/conclusion for Figure 1B is unclear. Moreover, Figure 1B is unrelated to 1A and out of order in terms of where IOKR was discussed in the manuscript.
Figure 2 shows a problem with the range of the expected value and variance (which varies with GCF and MF size) before standardizing the correlation score, yet, there is no chart to show how this changes after the correlation score is standardized. The authors should consider adding charts to show how this changes after the score is standardized, as well as an interpretation to help the reader understand why this change was necessary. For example, it is unclear whether a yellow (high) or blue (low) score is more desirable and it is unclear what the ideal distributions would look like. Additionally, the scales on the charts are inconsistent and make them difficult to interpret since they utilize the same color gradient. We suggest labeling the scale high-low rather than numbered if the values are not comparable. The authors should consider inverting the Y axis so it starts with zero at the bottom and increases as you move up the chart.
It is unclear how Figure 1 and Figure 2 are related. Is Figure 2 explaining how the problem in Figure 1 was fixed? If so, the authors should consider combining the two figures such that Figure 1a and Figure 2 are combined and Figure 1b is presented later in the text along with the section discussing IOKR framework.
The authors should consider using more concise language to help communicate the utility and limitations of this method more effectively. For example, the authors should use the term “bacteria” instead of “microbe” because the databases feeding into the program are heavily biased toward bacterial metabolites, and fungal metabolites are not well represented nor are they in the three datasets tested in the manuscript. It would also be worth considering giving the new score introduced in this manuscript a name, and referring to it by that name throughout the paper to avoid confusion.
This manuscript could benefit from a concrete example (actual metabolites linked to specific BGCs). A firm example with compound names linked to a specific gene cluster would help the reader evaluate how well the method performs compared to traditional methods. The manuscript does evaluate the performance of this technique using “verified hits”, but the identity of those hits and how they were verified remains unclear (unless the verification was the original report, which was also somewhat unclear). For example, the BGC listed at the top of Figure 6 (BGC0000137) encodes rifamycin, a commonly known bacterial metabolite. The authors should consider revealing the identity of at least one verified metabolite and providing a list of “hits” for the BGC encoding that metabolite with associated scores. This would allow readers to more effectively evaluate the new scoring method and determine what a “good score” looks like. A good choice for an example would be a metabolite/ BGC pair where a link was observed using this method and not other methods..
We appreciated that the authors discussed the data dependent limitations. They were apparent when reading the manuscript, and although they were briefly addressed in the discussion section, they might be more thoroughly discussed. The authors should consider drawing attention to biases toward specific organisms or metabolite classes in the databases feeding into the program, and discuss how those biases might affect the results and limit the usefulness of this scoring method to specific applications.
In Figure 6, for BGC0001228, it appears as though IOKR alone provided a higher score for the verified link than the combined score. Can the authors comment on the features of the data that led to this anomaly, and provide suggestions on how to determine the best scoring method?
We are excited at the prospect of using this tool. At the time we accessed the paper for discussion on 6/23, NPLinker did not have any documentation in the github repository so we were not able to evaluate how well it functions. The authors should provide comprehensive documentation. Additionally, a figure outlining how to use NPLinker to analyze a real dataset would be helpful, either in the manuscript or documentation.
Minor:
The metabolite used as an example in the introduction (C35H56O13) has a very unique molecular formula which is easy to link to a BGC (if the metabolite product is known), for example, NP Atlas only returns one hit for this molecular formula. The authors should consider picking an example that would have multiple hits.
The bottom of section 2.2 states “In the case of Figure 1, the standardised scores<br />
are 0.0 and 2.65, favoring the right-hand pair”, but in Figure 1 the two scenarios are positioned vertically, not horizontally.
The caption for Figure 5 says verified links are colored green, but in the figure they are red.